A minimal implementation of Cons
Seeing the complexity of the Clojure interface/data-structure ecosystem as we did in the last post can be a bit daunting. But if we start gently we can tease out some of the basic interactions and techniques that underlie how the real Clojure versions of these data structures are implemented.
Despite perhaps putting the scare into you with the diagram of Cons in the last post, a simple cons cell is the place to start. A cons cell is just a node for singly-linked list – a field containing an item and a field pointing to the rest of the list, perhaps the simplest notion of sequence we can come up with. However, as we will see, we need more than just this node type to meet the needs of the Clojure interfaces.
The interfaces
To simplify our task, we will implement only the trio of interfaces that sits at the heart of Clojure’s sequence paradigm – ISeq, IPersistentCollection, and Seqable.
If you need a refresher on how Clojure works with sequences, you can take a gander at Sequences. We are about to expose what lies underneath, what a data structure must implement in order to participate in actions such as interating through a sequence.
Let’s start with Seqable:
type [<AllowNullLiteral>] Seqable =
abstract seq : unit -> ISeq
A Seqable is anything that can produce an ISeq to iterate over its contents. Calling seq on a Sequable will hand you an ISeq, an object that represents a sequence, suitable for iteration and other sequence-y things. For some ojbects, seq might return the object itself: its class implements ISeq. For other objects, calling its seq() will produce a different object to handle the iteration. Thus, something could be Seqable without itself being an ISeq or an IPersistentCollection.
Side note: If you think
Sequable.seqsounds like theseqfunction in Clojure, well, yes. But the Clojureseqcan be called on many types of objects, first and foremost those that implementSequable. However,Stringdoes not implementSeqablebut nevertheless evaluate(seq "abcd")and you will get back a sequence comprising the characters in the string. Clojure’sseqis implemented as a call to a runtime library function (clojure.lang.RT.seqif you must know).RT.seqfirst checks if its argument implementsSeqable, in which case it calls itsSeqable.seqmethod. Otherwise, it runs through a bunch of special cases – strings, arrays,IEnumerable(or the Java equivalent) – and creates appropriateISeqobjects for them. (Real inside baseball: this is where protocols could come into play. Shhhh.)
The IPersistentCollection interface is straightforward. An object could be an IPersistentCollection without being an ISeq, hence these interfaces cannot be merged. But an IPersistentCollection must be Seqable; its essence as a collection is that you can iterate over it. It’s just that the iterator (its seq) may be different kind of object.
and [<AllowNullLiteral>] IPersistentCollection =
inherit Seqable
abstract count : unit -> int
abstract cons: obj -> IPersistentCollection
abstract empty: unit -> IPersistentCollection
abstract equiv: obj -> bool
A quick look at the methods in IPersistentCollection:
count()returns a count of items in the collection – for anISeq, that would be the number of items in the sequence from that point on.empty()yields an empty collection of the appropriate type. (You’d have to read comments in the Clojure code to get this.) A hash map would return an empty hash map, for example. In what follows, because aConscannot be empty –it has an element – some other type will have to be returned.x.cons(o)returns a newISeqwhich has the itemofirst, followed by the items inx. It is up to each type to figure out how to implement this. In actual Clojure ,PersistentListreturns a newPersistentList.EmptyListdoes also. Other types might use aCons.equiv(o)is used for equality checking on collections. Each collection defines its own. (Discussingequivand equality in Clojure will require a separate post.)
The ISeq interface captures the essence of iteration across a sequence.
and [<AllowNullLiteral>] ISeq =
inherit IPersistentCollection
abstract first : unit -> obj
abstract next : unit -> ISeq
abstract more : unit -> ISeq
abstract cons : obj -> ISeq
The difference between next and more is subtle. The best place to learn a little more (no pun) is here but you’ll have to read carefully. Originlly, Clojure had just first and rest. When laziness was added into the picture, a redesign was needed.
Essentially, more gives you an object that represents the rest of sequence, without actually triggering a computation to compute the next element in the sequence. next goes to the trouble of figuring out if there is more to the sequence or not (so to speak) – it will determine if there is a next element or if we have reached the end of the sequence. more is provided to avoid a potentially significant computation that might not be necessary. If you are doing something like iterating through a list, for example, you would use next – you know you are going to use the next element.
From Sequences we have:
The Seq interface
( first coll)
Returns the first item in the collection. Calls seq on its argument. If coll is nil, returns nil.
( rest coll)
Returns a sequence of the items after the first. Calls seq on its argument. If there are no more items, returns a logical sequence for which seq returns nil.
The doc string for the next function in Clojure states:
Returns a seq of the items after the first. Calls seq on its argument. If there are no more items, returns nil.
The rest function in Clojure is essentially ISeq.more. Note that it does not return nil. Thus, we need something to represent an empty list that is not nil. A cons cell is never empty. Thus, our minimum is a cons cell and ‘a logical sequence for which seq returns nil’ – an EmptyList.
You should note that the null value in Clojure –
nil– is not an empty list. And the empty list –()– is notnil.
(nil? ()) ; -> false
(nil? nil) ; -> true
(seq? ()) ; -> true
(seq? nil) ; -> false
(list? ()) ; -> true
(list? nil) ; -> false
Getting started
Rather than just jump directly into implementing SimpleCons and SimpleEmptyList, I’ll complicate things a bit and introduce some helper functions. We could do without these for our simple example, but this mimics how it is actually done in the Clojure code and also will set us up to implement more complicated data structures down the line. I put these helpers into a module named, of course, Util.
Equality first.
let checkEquals o1 o2 = obj.ReferenceEquals(o1,o2) || not (isNull o1) && o1.Equals(o2)
let rec seqEquals (s1:ISeq) (s2:ISeq) =
match s1, s2 with
| null, null -> true
| null, _ -> false
| _, null -> false
| _ -> checkEquals (s1.first()) (s2.first()) && seqEquals (s1.next()) (s2.next())
checkEquals just checks if two objects are equal. As written, it short-circuits through Object.ReferenceEquals to both handle null values properly and avoid unnecessary calls to Equals.
seqEquals is prototypical example of iteration through a sequence using first and next. In this case, it happens to iterate through two sequences simultaneously. The recursion stops when then end of one of the sequences is reached. (And if they are not both at the end, of course they cannot be equal.)
There is a notion of equiv (equivalent) that is separate from Equals, mostly in how numeric values are handled. We ignore that here, but reserve a place now for elaboration in a later post.
let seqEquiv s1 s2 = seqEquals s1 s2
While we are on basics, we need to make hash codes consistent with equality for sequences.
let getHashCode (s:ISeq) =
let combine hc x = 31*hc + if isNull x then 0 else x.GetHashCode()
let rec step (s:ISeq) hc =
if isNull s then hc
else step (s.next()) (combine hc (s.first()))
step s 1
This walks the sequence and combines hash codes for each element. Finally, we need a standard way to compute the count of a sequence. (This is a slight simplification of the code in RT.)
let seqCount (s:ISeq) =
let rec step (s:ISeq) cnt = if isNull s then cnt else step (s.next()) (cnt+1)
step s 0
Finally, let’s provide a little help for printing sequences. (Feel free to skip this on the first pass.)
let rec seqToString (s: ISeq) =
let sb = new StringBuilder()
let rec appendItem (o: obj) =
match o with
| :? Seqable as s -> appendSeq (s.seq ())
| _ -> sb.Append(o.ToString()) |> ignore
and appendSeq (s: ISeq) =
match s with
| null -> ()
| _ ->
appendItem (s.first ())
sb.Append(" ") |> ignore
appendSeq (s.next ())
sb.Append("(") |> ignore
appendSeq s
sb.Append(")") |> ignore
sb.ToString()
Again, a sequence iteration implemented by recursion. Please note this emphatically is not how printing is handled for in the Clojure code; this will serve as a placeholder for now.
The empty sequence
And now we can write our two collection classes. These are mutually recursive and so are joined by and in the actual code. Let’s start with the simpler ‘empty sequence’.
type SimpleEmptySeq() =
override _.GetHashCode() = 1
override _.ToString() = "()"
override _.Equals(o) =
match o with
| :? Seqable as s -> s.seq() |> isNull
| _ -> false
The basic overrides are simple: Fixed hashcode value; ToString as required by Clojure. (That () should look familiar.) For equality, we can only be equal to an empty sequence. As Clojure is set up, an empty sequence is something that returns nil (= null in .NET) when seq is called on it. That means we need something that seq can be called on: a Seqable.
Yes, in Clojure, when you call
seqon an empty sequence, you get backnil:
(seq ()) ; -> nil
You can read all kinds of dicussions on Clojure groups about why one tests (seq s) for various conditions – keep in mind that nil counts as false in boolean conditions.
On to the sequence goodies. The defining characteristic of an empty sequence object is that calling seq on it returns null, as we just saw.
interface Seqable with
member _.seq() = null
Next, an empty list is a very simple sort of collection.
interface IPersistentCollection with
member _.count() = 0
member this.cons(o) = upcast (this:>ISeq).cons(o)
member this.empty() = upcast this
member this.equiv(o) = this.Equals(o)
The object is its own empty. Obviously the count is zero. equiv equates to Equals here.
For cons, the definition is a bit odd. We have two different conses floating around: IPersistentCollection.cons and ISeq.cons. They have distinct return types. Because an ISeq is an IPersistentCollection, you will often see what have done here: implement the IPersistentCollection version in terms of the ISeq version, with an upcast.
Finally,
interface ISeq with
member _.first() = null
member _.next() = null
member this.more() = upcast this
member this.cons(o) = upcast SimpleCons(o,this)
The values for first and next are by definition. We need to return an emtpy sequence–not null– for more; this object itself is such a thing. Finally, with cons we see our entanglement with SimpleCons.
A simple cons cell
And so on to SimpleCons:
and type SimpleCons(head: obj, tail: ISeq ) =
The main constructor is all we need, along with two (immutable) fields to hold the values. For the Object overrides, we can defer mostly to our utility functions:
override this.GetHashCode() = Util.getHashCode this
override this.ToString() = Util.seqToString this
override this.Equals(o) =
match o with
| :? Seqable as s -> Util.seqEquals (this:>ISeq) (s.seq())
| _ -> false
We will compare affirmatively only to other sequences, and they must have equal elements in order.
interface Seqable with
member this.seq() = upcast this
Because SimpleCons implements ISeq, the object can be its own seq.
interface IPersistentCollection with
member _.count() = 1 + Util.seqCount tail
member this.cons(o) = upcast (this:>ISeq).cons(o)
member _.empty() = upcast SimpleEmptySeq()
member this.equiv(o) =
match o with
| :? Seqable as s -> Util.seqEquiv (this:>ISeq) (s.seq())
| _ -> false
As a collection, SimpleCons has its own element plus the elements in its tail; thus count.
For cons we defer to ISeq.cons as mentioned above.
And empty gets us the reverse entanglement with SimpleEmptySeq.
Finally,
interface ISeq with
member _.first() = head
member this.next() = (this:>ISeq).more().seq()
member _.more() = if isNull tail then upcast SimpleEmptySeq() else tail
member this.cons(o) = upcast SimpleCons(o,this)
I hope it is not a surprise that to cons something onto the front of ourselves, we use a SimpleCons.
The dance here between more and next is common.
more cannot return null, so it cannot just return the tail. If the tail is null, we must return ‘a logical sequence for which seq returns nil’: our SimpleEmptySeq. That test for emptiness here is very specific to SimpleCons: a null tail.
next is often defined in terms of more in the manner shown here. If more returns an empty sequence, the seq on it will return null, which is what next’s contract says. And our tail is an ISeq, which is an IPersistentCollection, which is an ISeq, so the call to seq() on the more will always be safe.
And we are done.
Source code for these examples are available at ClojureCLR-Next repo.