18 October 2019 - 10 minute read

So I recently got my parser for O to a feature-complete point and of course I'm really chuffed about it, but I hit all sorts of walls along the way and learned a tremendous amount, so I wanted to write about some of these issues I had and how I ended up putting this parser together.


At the start of the parser project, I wanted to try designing it from scratch. The design I landed on unaided by the internet was very similar to a recursive descent parser. The basic idea was that I specified a set of "forms" for each node type that followed a simple regex-like format, supporting optional or repeated components. This made the grammar very easy to write. The parser then started at the top and tried to build a File node by trying to build each form of a File and then keeping the longest one that succeeded. If none succeed, display a parser error.

The advantage this had was that it was very intuitive to write. The end goal is a File, so we start by looking at what we need to get to, and then work our way down, resolving the "dependencies" required to put it together.

The problem with this though was that it was agonisingly slow. With the old MiniO syntax which was honestly fairly simple, it would take nearly 10 seconds to parse even a very small test file, and because it was essentially doing a depth-first search of the entire syntax schema, it was running in factorial time complexity. In a real-world project with 10s of larger files, parsing would've taken 10s of minutes. This only gets worse when moving up to the dramatically more complicated O grammar which in its current form has 547 rules in it. It's then very obvious that this really isn't the right solution.


After a bit of research into parsing algorithms, I found the LR algorithm, so I had a go at implementing that.

What I created wasn't actually an LR parser, but it worked in a similar sort of way. It read in tokens from start to finish, and the idea was that it looked at every rule, then for each new token it removed rules that didn't match it until the last token that had a matching rule, at which point it would reduce that rule, then go back to the beginning and use the new node as the first thing to narrow down the rules again.

If you stop and think about this for a few minutes, you'll probably realise that this is doomed to fail. As soon as there is a rule that has a non-terminal (node that isn't a token) anywhere other than the start, it's not going to work. So this solution went in the bin too.


While doing more research, I came across this page that I found was actually quite a good explanation of how these parsers worked.

The key thing I found out at this point was that the bulk of the processing can be done external to runtime of the compiler. Because parsing only relies on the action table and list of rules, the action table can be generated externally. I later learned this is how most parsers of this kind are built anyway, but this project is all about learning how everything is done, so I'm trying to implement every part of it myself.

So my LALR generator took a tremendously long time to run. With earlier iterations of the grammar it would take nearly half an hour. That said, it works. I get an action table out, which the tool formats as D source code and puts in a file in my lib project. Then when I build the compiler, the whole action table is processed at build time, meaning execution time is much shorter.

However, this action table file is very big. My current one is 2.9MiB and around 50k lines long, but earlier on in the process it was exceeding double that. This doesn't sound like much of a problem - and for compiler performance it really isn't - but it is horrible for build time. The compiler takes about 5.5 minutes to build, and that's almost entirely a result of the action table. Without it, it would be building in a few seconds. This is still a problem with the finished parser as I haven't been able to find a reasonable way of reducing the build time without also increasing the execution time, but it's bearable - I'm just finding I'm continuously building and running, and editing code while it's building, so I'm fixing bugs from the run before last. It's a little weird, but you get used to it.

On the parsing end though, it's a tremendously simple algorithm as it's just a matter of having it remember where it's been and then just follow the trail through the action table until it either finishes or errors. However, working with it made me discover something rather annoying about my grammar: it's an ambiguous grammar, but more ambiguous than a lot of other grammars. The problem this caused is that when the LALR generator is building the action table, it's overwriting a lot of cells, meaning some perfectly valid constructions cause errors, because the action to get to them doesn't exist any more. More research would be needed.


A GLR parser does basically the same thing as an LR parser*, but the idea is that if it hits an ambiguity (somewhere where the action table would have multiple options), then rather than just take one route and hope it's the right one, a GLR parser will take all of them at once, breaking up into multiple parsing "threads". Most threads will reach a reject state, but rather than print the error and exit the compiler, they fail silently. It's only if the last remaining thread fails that the whole compiler does.

* Note that a LALR parser and LR parser use exactly the same parsing algorithm. They differ in how the action table is generated, where the LALR parser does some extra stuff which I don't know enough to properly explain.

In a "normal" GLR parser, threads are processed in lock-step, meaning every thread is moved forward by a single token before any threads move forward again. This was the way I originally tried to build the parser, but this became much more challenging that I had anticipated as there are lots of places where to advance by one token, several actions are required. For instance, if a thread reads in a token and that corresponds to a reduce action, then not only will there be the reduce action, there will also be a goto action after it before it's ready to read the next token. It's also a challenge to create new threads. When creating new threads, each new thread has to complete 1 action immediately as that's what gets them all over the point of ambiguity.

The solution I ended up on was arguably simpler: just run each thread entirely in turn. So the way it works is I start with a single thread on the first token and it starts going through. If it hits an ambiguity, it creates new threads for all but 1 of the actions and adds them to the thread queue, the current thread then takes the remaining action and carries on until it either fails or succeeds. If it fails, we move to the next thread in the queue and run that from its starting point. The process continues until we reach the first thread that succeeds and we take the output AST from that and move onto the next part of the compiler chain.

The way errors are done is we keep track of the furthest point any thread has reached. When a thread fails, we have a CompilerError object. We can compare its location with the current error, and if the new one is further through the file then we make the new one the one we're remembering. If the final thread fails, then this saved error is the one that is displayed as it is going to be the only important one.

It is this GLR parser that now powers the O compiler. It was a very interesting process and I learned a tremendous amount about various parsing algorithms and how they're implemented.

If someone reading this happens to be playing with parsers, I cannot recommend learning about and implementing the whole thing yourself. You're not going to get the most performant parser at the end of it, but you will learn much more about how they're put together and come out of it with a much better understanding of how your grammar works, which can make it much easier to think about the following compiler stages and how to implement them.


Olang - Programming


Copyright Oliver Ayre 2019. Site licensed under the GNU Affero General Public Licence version 3 (AGPLv3).