Saturday, August 20, 2011

What is the meaning of source code?

Today I was thinking about literals in source code and what is their meaning. I ended up in a conversation with @msimoni and we were talking past each other, so I figured out it was better to write it down anywhere.


Let's start with a language based on the untyped lambda calculus, the syntax is pretty simple:


    e = x λx.e | e e


It's not the most pleasant syntax (although is many times better than the Turing machine) to program. So we decide to add some sugar on top of it: data-types, special forms, literals, etc.. Hopefully we define these on the calculus primitives and the language does not become more complex to reason. Essentially the meaning of the additional syntax is their translation to the calculus. Now we have two choices: either we add this sugar as special syntax or we bake some form of syntax extensibility in the compiler/interpreter/front-end. Before we go down this road I'll digress a bit and talk about source code.

Most languages use source code as a way to express programs and as an input to compilers/interpreters, but when we start viewing it as one representation of the program we need to define what a program is. Eventually we'll use some syntax (usually the same) to denote the program's meaning, but we need to think of the program as an object we can manipulate instead of text to be fed to the compiler. If we're going to store and render the program we don't need to use the same forms for both needs. Lets talk about a concrete issue, how do we store and represent numbers.

If we encode naturals in our language using Church encoding we can easily render 2 to the programmer and accept it as a valid input, while we store the program using the complete encoding (i.e. λf.λx.f (f x)). We can even represent λx.2+x while internally keeping it as the expanded form (left as an exercise to the reader :). Of course this way makes the storage less efficient than it could be which leads us to the first observation: literals are just compression mechanisms. They work as compression for both the compiler/interpreter and for us (as it's simpler to read/write 2 than the equivalent encoding). If we just store the encoded number though we need to also keep something to remind us to compress it to 2 when rendering it, we could rely on pattern matching and equational reasoning but we'll find out that many compression schemes work with the same encoding (e.g. the user would see either the number 2 or part of data structure).


Viewing it in this light we can go back to literals with a new perspective, our language can have a limited set of literals with built-in encoders or have a simpler literal form and an extensible encoding mechanism. Most languages go the first way but the libraries' programmers end up using strings and some parsing or EDSLs. If we go the second way we end up with LISP-like macros for literals. Even if the language is going to have a fixed set of literal forms it's better to have almost no intelligence in these than to add support for many specific concerns. For example, it's better to have plain string literals without any embedded escaping mechanism than to have escaping for special characters (that are only special because of the escaping concerns) and to support mechanisms to control terminals or do layout (e.g. \n and friends).