Lets make a programming language

Series: Implementing Ligature


February 9th 2020 | ~7 min read

Recently I became intensely interested in how programming languages are implemented. While most people will never have the need to write a production ready programming language I'm finding writing a language a really fun exercise and that its given me a basis for understanding unexpected behavior or limitations of the existing languages I use every day.

A lot of the content around this topic assumes familiarity with lower level languages such as C/C++ and/or tools such as Bison, Yuacc, Antlr to name just a few. Coming from a strong javascript background I wanted to see if I could do the same thing but in a higher level language. The goal here is accessibility, not necessarily raw performance. That said I don't think using js to write the compiler precludes the possibility of bootstraping the creation of a lower level, more efficient language - especially if used in conjunction with LLVM but more on that latter.

Hello Ligature

To provide context we'll go through the process of implementing a toy language called Ligature. I could probably write a whole post just on what features my dream language would have but I'll try and hit a balance of simplicity and novelty. Ligature will be similar to Javascript but we'll try and avoid some of the messier parts of the language that often create bugs and tricky interview questions ;)

Besides the basics these are the two more ambitious / experimental features we're going to try to implement:

  • Cloud native: With the advent of 'infrastructure as code' it is now possible to declaratively express the necessary infrastructure required to deploy distributed systems. By involving a compiler in this process it can potentially be made more accessible to software devs unfamiliar with IaC and remove some of the manual effort required to provision cloud resources. Basically, given a program and some config I'd like to be able to 'compile' its infrastructure. Imagine if you could deploy a aws lambda function automatically by adding a @lambda annotation to it for example?
  • Built in Relational algebra - I'd like to add built in querying capabilities to ligature, similar to what you'd typically see in something like sql. Interfacing with databases is a common but can sometimes be clunky. Worst case this just becomes a language level ORM but I think there are some potential advantages. There have been a couple experiments at making sql dialects that behave more like sql but I'd like to go the other way and make a programming language that has features of sql. Imagine if you could write stored procedures in a functional programming language or query in memory data structures the same way you'd query a database.

By bringing these two usually separate functions together inside a single tool Im hoping that it could make it significantly simpler to write distributed data systems and is metaphorically the reason I decided to call the language Ligature.

The Theory

Trying to go from raw text to an executable program is non trivial so traditionally its broken down into smaller, more manageable steps. Note that I personally found the official academic distinctions became much more useful after working the problem a bit but its useful to have a mental model for where we are in the process. For our purposes we will break this process down into 4ish steps.

  1. Parsing:

    Using a tool called a parser generator (in this case peg.js) we define a grammar and from it generate a parser. The parser transforms the raw text string representing our program and turns it into a intermediate form (this is a recurring theme) called An AST. An AST or Abstract syntax tree is just like it sounds, a recursive data structure that will make our program easier to work with in future steps.

    Input: PEG Grammar + Ligature script

    Output: Ligature AST

  2. Transpilation:

    To me being able to quickly prototype ideas is essential, so next we'll discuss how to transpile our AST to a high level language like javascript because its easier than compiling all the way down to assembly. A transpiler is a portmanteau of transform and compiler. You may see the terms used interchangeably which is kinda confusing but here we're defining a transpiler as a program that takes code form one high level language and transforms it into the code of another. For anyone who's used Babel you've used a Transpiler before.

    Essentially we will take that recursive data structure representing our code (eg [['add', 2, 4]]) and return a string of javascript that does the equivalent operations, eg: 2+4.

    Input: Ligature AST

    Output: Javascript.

  3. Evaluation:

    now that we've derived valid js from Ligature code we'll take a quick detour an provide a simple way of running arbitrary Ligature code from the command line:
    $ ligature ./helloWorld.ls

  4. Compilation:

    For some projects stopping here should be totally fine but if you want to take it to the next level or just need lower level control over the performance characteristics of your language compilation is the next step. To make this step easier we'll use a tool called LLVM. LLVM website is retro to say the least but its a useful tool with a number of good resources online to describe how to use it. The primary reason we're using it is that it will alow us to run Optimized Ligature on multiple machines and has a couple of nice features such as supporting JIT compilers and can compile to web assembly which is cool.

    LLVM takes 'LLVM intermediate representation' and is capable of compiling our language so that its independently runnable on various platforms(mac, pc, web assembly etc)

    Input: Ligature AST

    Output: LLVM IR -> Assembly or Web assembly

The process

I was recently inspired by one of Allen Holub tweets:

Its a relatively simple observation about product development but its surprisingly easy to forget and one that I realized I've ignored before. Writing a language from scratch could easily suffer from the same problem - we implement all of the parser then the transpiler then give up after reading 5 different conflicting blog posts about llvm ir...

Instead we're going to implement each of the following features, in vertical slices down to the transpiler level as thats the quickest way to produce something that runs. Each of these features will constitute a separate article and I'll update the links below as I work though them. Not everyone need will want / need to compile down to assembly and it adds significant technical complexity so for some features I may treat the compilation as a separate optimization step:

Basics:

  • Intro to Peg.js
  • Assignment and basic math
  • Functions
  • Objects / Records
  • Type system

Next steps:

  • Partial evaluation of Operators
  • Immutability
  • Async abstractions (Algebraic effects??)
  • Relational algebra operators
  • Infrastructure annotations

Quality of life improvements:

  • Readable error messages
  • Assertions (testing)
  • Basic Module system
  • language server / syntax highlighting

If there are any features you'd like to see added to ligature leave a comment below!

This list is not final and may change.

Getting started

We'll be using Peg.js extensively for parsing so I'd recommend checking out my Intro to Peg.js if you've never used it before.

I'll be releasing the first article on ligature: Assignment and basic math shortly so stay tuned!

If enjoyed it let me know by applauding the article and follow me on youtube and twitter to stay updated on all my latest content.