|

Generically adding position information to a datatype

Posted on Thursday, 24 June 2010 at 13:06.

Every now and then datatype fixpoints come up, especially in the context of generic programming:

newtype Fix f = In { out :: f (Fix f) }

Most explanations of this datatype I have read or heard start with this definition and then proceed to explain it, using various examples. In today’s post I will also introduce you to this datatype, but I want to take a different approach: I will show you a problem to which the Fix datatype is the natural solution, deriving its definition along the way.

The Haskell code in this post does not use very advanced features: there are no type functions or even type classes, only datatypes and parameters. If you are familiar with datatypes, type parameters and their syntax, it should not be hard to follow. If you have any questions, feel free to post them!

The problem

Let’s take our trusty old friend the arithmetic expression datatype:

data BareExpr
  = Num Int
  | Add BareExpr BareExpr
  | Sub BareExpr BareExpr
  | Mul BareExpr BareExpr
  | Div BareExpr BareExpr

I’ve called it BareExpr here for a reason: we are going to change it in such a way that we can also store position information in it, resulting in type PosExpr, so that when a PosExpr is produced by a parser, we can trace back where in the original source code the tree nodes were. This is useful in various applications. For example, compilers that output error messages generally provide position information about where the error occurred exactly. It is also useful in tools that need to understand text selections in the source code, such as editors that feature refactoring.

Adding position information to a single datatype is not very difficult. After we have done so for BareExpr, we will look at the real problem: how to do this for any datatype.

Expressions with positions

There are several ways to add position information to a datatype. In our case we will couple every occurrence of BareExpr with a location. Let’s call the type of locations SrcSpan and the annotated version of the expression datatype PosExpr:

type PosExpr = (SrcSpan, PosExpr’)
data PosExpr'
  = Num Intr’
  | Add PosExpr PosExpr
  | Sub PosExpr PosExpr
  | Mul PosExpr PosExpr
  | Div PosExpr PosExpr

In a series of steps, we will reach our final solution.

Step 1: Capture the similarities between BareExpr and PosExpr

BareExpr and PosExpr' are very similar: they both contain five constructors, and each constructor has the same number of fields. Can we capture this structure somehow? Yes, we can: the two types only differ in the types of their recursive positions, and in a very regular way. We can do here what we would do in any similar case: make the parts that differ arguments, and then express the original entities in terms of this new, general entity by providing specific arguments.

Haskell allows us to do that with datatypes: simply introduce a new type argument r. We call the resulting type ExprF:

data ExprF r
  = Num Int
  | Add r r
  | Sub r r
  | Mul r r
  | Div r r

The F in ExprF stands for functor, and such a datatype is usually called a base functor. Base functors determine the shape of the top level of a tree, but the shape of their children is determined by the type argument.

Step 2: Express BareExpr in terms of ExprF

Now we need to recover BareExpr and PosExpr by expressing them in terms of ExprF. For BareExpr, we want the child positions of ExprF also to be bare expressions. This leads to an infinite type:

BareExpr ~ ExprF (ExprF (ExprF ...))

This says that to get bare expressions back, we want to take ExprF and have its children be ExprFs again, and those children’s children to be ExprFs again, and so on. In Haskell we can encode infinite types by introducing new datatypes (we reuse the name BareExpr here):

newtype BareExpr = BareExpr { runBareExpr :: ExprF BareExpr }

If you repeatedly expand this definition, you will see that it results in the infinite type above.

Step 3: Express PosExpr in terms of ExprF

For PosExpr we can think of a similar infinite type:

PosExpr ~ (SrcSpan, ExprF (SrcSpan, ExprF ...))

Again, we write this down using a new datatype:

newtype PosExpr = PosExpr  { runPosExpr  :: (SrcSpan, ExprF PosExpr) }

Step 4: Generalize BareExpr

Currently BareExpr works only for the ExprF shape. Let’s create such a ‘bare’ version for any shape instead of just ExprFs. We can do this by making the base functor an argument:

newtype BareExpr f = BareExpr { runBareExpr :: f (BareExpr f) }

On the right-hand side, we have replaced ExprF by the argument f. In step 2 we supplied the type we were defining as argument to ExprF; in this new version we do the same to f, but since this new version has a type argument, we need to supply this argument in the recursive position as well.

But… this datatype is no longer specific for arithmetic expressions, so the name BareExpr is not very appropriate. In fact, the type we have just defined is the famous Fix disguised under a different name!

newtype Fix f = In { out :: f (Fix f) }

So now you know what Fix does: it takes a base functor, such as ExprF, and recursively applies it to itself, creating a tree that is of the same shape at every level.

Our new definition of BareExpr doesn’t need to introduce any new datatypes but can now be a simple type synonym:

type BareExpr = Fix ExprF

Step 5: Generalize PosExpr

For PosExpr we can make two generalizations. The first is to not just store source locations, but allow any type of annotation:

newtype AnnExpr x = AnnExpr { runAnnExpr :: (x, ExprF (AnnExpr x)) }

The second is similar to the one we made to BareExpr: have it work for any base functor instead of just ExprFs:

newtype AnnFix x f = AnnFix { runAnnFix	:: (x, f (AnnFix x f)) }

To recover PosExpr, we give AnnFix the two appropriate type arguments:

type PosExpr = AnnFix SrcSpan ExprF

Step 6: Express AnnFix in terms of Fix

The Fix type captured the idea of take a functor and applying it to itself recursively. AnnFix does something similar. Can we perhaps express AnnFix in terms of Fix to make this explicit?

It turns out we can, if we introduce a helper datatype Ann:

data Ann x f a = Ann x (f a)
type AnnFix x f	= Fix (Ann x f)

Ann couples an annotation x with a functor value. It’s kind of a tuple type, lifted to a higher order on the right side.

Summing up

We have seen many (intermediate) definitions of datatypes, but in the end only two of them matter:

newtype Fix f     = In { out :: f (Fix f) }
data    Ann x f a = Ann x (f a)

And of course, we have our expression example expressed in terms of these two building blocks:

type BareExpr = Fix ExprF
type PosExpr  = Fix (Ann SrcSpan ExprF)

With just these two building blocks, we can express generically annotated trees and unannotated trees. What is the point of generalizing this far? Well, by making these types not specific to a particular tree shape (such as ExprF), you can build all sorts of tools that work on many kinds of trees. In my Masters thesis I explore this concept further, developing parser combinators that automatically insert the position information for you at the appropriate places, catamorphisms that automatically couple errors with position information, conversions between text selections and tree selections and a couple of other things.

Read more

If you’re interested in datatype fixpoints and would like to know more, here is a collection of interesting tutorials, applications and papers:

Comments

By Edward Kmett on Thursday, 24 June 2010 at 20:07:

AnnFix is the f-branching stream/cofree comonad (with the arguments flipped around).

I tend to like to express it with:

data f :> a = a :< f (f :> a)

it is available in Control.Comonad.Cofree in category-extras

or in Numeric.AD.Types from ‘ad’ as Stream.

You may find the following bits and pieces enlightening:

http://comonad.com/reader/2009/incremental-folds/ puts the (:>) type above to use, and calculates those annotations using f-algebras using smart constructors.

http://splonderzoek.blogspot.com/2010/03/final-push-ups-push-downs-and-passing.html is Leather, Loeh, and Jeuring’s take on the same thing. They, like you, and my code in category-extras, chose to separate the fixed point from Ann. They provide a similar incremental fold to the code in my blog, but with less type safety.

By Aycan iRiCAN on Saturday, 26 June 2010 at 01:12:

Very nice blog. I was reading Sebastiaan Visser’s msc thesis which also uses FixF for annotating recursive types. Thank you.

By Andrew on Tuesday, 04 January 2011 at 23:20:

Thanks for the blog post (and the WGP paper is also nice). Looks like that’s exactly what I need. Are you still supporting the Annotations library?

Leave a comment!

Martijn loves to receive comments! Add yours by filling out the fields below.