Circular reasoning (part 2)
The code with the greatest entanglement across the Clojure codebase comes in the static classes clojure.lang.RT
and clojure.lang.Util
. How can these classes be restructured to reduce cyclic dependence and improve clarity?
Digging in
Line counts may not be a good indicator of complexity, but let’s be clear: RT
and Util
are big. RT.cs
is over 3500 lines of text; Util.cs
is another 900 lines. We have a lot of code wade through.
These classes provide services to several distince areas of Clojure: collections; the readers; the compiler; and the Clojure source code in core.clj and related. I am trying to structure the code so that collections are implemented in their entirety before the other pieces. So my analysis below is based on how RT
and Util
are used in the collection classes.
Util
Let’s start with Util
. Breaking the code into categories:
- hashing
- equiv/equals
- numeric support
- conversions
- definition of categories
- little utilities
- miscellaneous
The problem of hashing, equality and the difference between equiv
and equals
is going to take its own post. For the moment, what’s most concerning is that these things involve numerics: what is a number? how do you compare numbers for equality? how do you hash numbers? A lot of these problems are handled in Util by calls to methods in clojure.lang.Numbers
– and suddenly we have 4700 more lines of text to analyze.
We could extract some things from Numbers
and just use them; Numbers.hasheq
, for example. But also referenced are Numbers.equal
and Numbers.compare
that use the whole machinery that Numbers
set up. It might be possible to pull out just those pieces for now. If we have to implement all of Numbers
, we’re going to be here for a while.
Other than this, I think Util
is not so deeply entwined with the other parts of Clojure that we can’t implement it and make it availble early in the compile sequence.
RT
I will spare you a full-blown analysis. There is plenty of code that is not needed for implementing collections and can be deferred until later. There are a few trouble spots in collections.
The first is a set of methods that focus on maps: RT.map
, RT.assoc
, RT.dissoc
primarily. These use some map classes. But it looks like the maps that use them are not ones they refer to. Just splitting them out and introducing them into the right place in the compile sequence should suffice.
RT.seq
The second area concerns RT.seq
. This method converts its argument into an ISeq
. It is like Seqable.seq
– in fact, it uses that, but also special cases things like String
. It also puts some special cases up front for efficiency. Here is the C# code:
public static ISeq seq(object coll)
{
if (coll is ASeq aseq)
return aseq;
return coll is LazySeq lseq ? lseq.seq() : seqFrom(coll);
}
// N.B. canSeq must be kept in sync with this!
private static ISeq seqFrom(object coll)
{
if (coll == null)
return null;
if (coll is Seqable seq)
return seq.seq();
if (coll.GetType().IsArray)
return ArraySeq.createFromObject(coll);
if (coll is string str)
return StringSeq.create(str);
if (coll is IEnumerable ie) // java: Iterable -- reordered clauses so others take precedence.
return chunkEnumeratorSeq(ie.GetEnumerator()); // chunkIteratorSeq
// The equivalent for Java:Map is IDictionary. IDictionary is IEnumerable, so is handled above.
//else if(coll isntanceof Map)
// return seq(((Map) coll).entrySet());
// Used to be in the java version:
//else if (coll is IEnumerator) // java: Iterator
// return EnumeratorSeq.create((IEnumerator)coll);
throw new ArgumentException("Don't know how to create ISeq from: " + coll.GetType().FullName);
}
You see our good friend ASeq
at the front, followed by LazySeq
. The problem is that ASeq
really needs RT.seq
for itself. As do Cons
, EmptyList
, and PersistentList
, either directly or through RT.first
, RT.count
and other RT
methods that use RT.seq
.
If this was all there was to it, I can think of some easy solutions, such as having two versions of RT.seq
, one for Cons
, EmptyList
and PersistentList
, and another for everyone defined after (perhaps with a cyclic dependency on ASeq
.) Or we could indirect through a static mutable binding to be updated at a later point.
But took a close look. RT.seq
uses ArraySeq
, StringSeq
and RT.chunkEnumeratorSeq
. And guess what? Those all are themselves or create things that are based on ASeq
.
Shall we do an inventory of our cylic dependencies?
Cons
EmptyList
PersistentList
ASeq
ArraySeq
derivatives (There are a bunch, include one for each primitive numeric type.)StringSeq
ChunkedCons
– used byRT.chunkEnumeratorSeq
RT.seq()
The multiple versions of RT.seq
solution seems unachievable. Indirecting through a static mutable binding might be possible, but a little distasteful.
One solution is protocols. If you are familiar with protocols in Clojure, basically some variation on that theme. Here, types would be registered as supporting the protocol corresponding to RT.seq
.
You will recall my mentioning in ClojureCLR reconsidered that Rich Hickey would use protocols from the bottom up if rewriting Clojure from scratch. Poster child here.
Based on what is in ClojureCLR right now, our version would need enchancements to handle generic classes (open or closed), maybe for types that are .IsArray
, and some other variations. It would need to be performant.
I don’t know how to do that yet. And I don’t want to wait until I figure it out. So the way forward is just to hack in a solution for RT.seq
that will get us out of this cycle of despair dependency. The hack will be a version of RT.seq
that has a lookup mechanism in which participating types can register in.
RT.printString
And we’re going to come the opposite conclusion on this one.
RT.printString
is called by pretty much all the collection data structures in their ToString
methods. The difficulty of writing this method ahead of the collections is that it relies on machinery that will be created later, MultiFn
and Var
in particular. It is definitely not possible to implement those types ahead of collections.
RT.printString
is extensible via MultiFn
. The code defining the extensions themselves is in the Clojure source, specifically clojure/core_print.clj
. And RT.printString
has a fallback in case it is used prior to that file being loaded, in RT.print
which does the real work. In the C# source,
static public void print(Object x, TextWriter w)
{
//call multimethod
if (PrintInitializedVar.isBound && RT.booleanCast(PrintInitializedVar.deref()))
{
PrOnVar.invoke(x, w);
return;
}
bool readably = booleanCast(PrintReadablyVar.deref());
// A bunch of code to print things in a default way
// ...
}
There is a dependence on two Vars, *print-initialized*
and *print-readably*
in this code.
We’ll have to finesse those calls until Var
has been defined. I’ll have to come up with some workaround.
There is just no way to bring the use of Var
and MultiFn
into the code at this point without creating a cyclic dependency involving more types than I can count.
Summary
The Util
class seems relatively straightforward other than its reliance on clojure.lang.Numbers
. We will analyze that in an upcoming post, along with a look at hashing and equality in Clojure.
As for RT
, of the things we need right now RT.seq
and RT.printString
are the problems areas. RT.seq
we will look at solving by having an extensibilty mechanism. RT.printString
will require a way to work around the lack of Var
at the beginning.
Both solutions to our RT
problem are going to require initializations further along in the source code when the needed pieces become available. This suggests ultimately some kind of static initializer that must be triggered before using the collections. That is not a problem when the collections are being used in the context of ClojureCLR as a whole – we already have RT.init
being called. (It is called in Clojure.Main when you start a REPL. If you are calling into Clojure.dll from a C# program, say, you likely will do something like IFn load= clojure.clr.api.Clojure.var("clojure.core", "load");
and the static initializer in RT
gets triggered pretty quickly.) It could be a problem in a testing context when we are isolating the collections. I’ll work it out.
What is clear is the monoliths of Util
and RT
will not exist as single modules in this rewrite. These are not advertised interfaces and outside of the base (C#/F#/Java) source code should be referenced only in the Clojure source code of core.clj
and the like. But we already do plenty of platform-specific modifications to that code. A little more won’t be a bother. But we should exercise some care in how we package this functionality into discrete modules.