Billipede.net

"You just won't believe how vastly, hugely, mind-bogglingly big it is."

filed under:

2015-02-14 Let's Write a Parser

Parsers are super useful. Without them, we would all have to operate and program our computers by twiddling switches directly in machine language like Grace Hopper or Mel did in the Dark Ages.
If you're like me, though, you never really stopped to think much about parsing besides in some CS class in college. If you're even more like me, you never took a CS class in college, and so you'd never really thought about it at all until one day you decided to start thinking about it, and then reading about it, and then writing some code in Javascript to get it all straight in your head and see if it works.
Ultimately, I hope to use what I learn here to build a clean parser for another project I'm working on (which should be in a publicly-sharable state Real Soon Now), but as a learning exercise I decided to tackle parsing fully parenthesized expressions. From the number of course websites you find when doing a search for that phrase, this is a very popular homework assignment in compiler classes like the ones I never took.
Fully parenthesized expressions are essentially regular math notation but without operator precedence (usually called "order of operations" when high school math teachers talk about it) because there are parentheses everywhere to distinguish what should be computed first. This, for example, is not a fully parenthesized expression:
2 + 4 * 10 / 2
Because you probably went to high school, you know that you should do the multiplication, then the division, and then do the addition last. But in a fully parenthesized expression, you have to do all that grouping explicitly with parentheses:
(2 + ((4 * 10) / 2))
Actually, I decided to use square brackets in mine, and not just because I like to be contrary. It is a significant and tragic fact that parentheses and curly braces are the main pairs of grouping symbols used in programming notation while, at the same time, square brackets are the only grouping operators that are accessible without having to use the shift key on a standard (US and others) keyboard layout. There's also the fact that "brackets" is easier to type and spell than "parentheses." Therefore, in my "fully bracketed expression" syntax, the same expression would look like this:
[2 + [[4 * 10] / 2]]
Great, we've decided what the syntax is that we're going to parse! Now, we probably just need to write it up in code and we'll be all set. From here on out, it's just a small matter of programming, right? Actually, no, we still have to do basically all the actual work of defining the grammar since, while you and I could read my description above and completely understand how to interpret [2 + [2 + 2]] or [100 * [4 / 1.927]], it's not actually specific enough to satisfy a computer.
Instead, we'll need to come up with a "formal grammar", which sounds intimidating, and kind of is, but is really not that hard to figure out. There's something called Backus-Naur Form which also sounds intimidating, and also kind of is, but similarly is not very hard to grok if you just focus for a second (i.e. put down your phone, close your mail client and crack a beer, pour a coffee, or brew some tea).
There's this funny operator :: that you can think of as a symbol that is used to specify the rules of the grammar. It could just as well be an equals sign, but I guess Backus and Naur liked to be contrary, which is something I can respect. I like to think of :: as just meaning "is made up of."
There are some other operators too, but if you're relatively programming- and computer-literate, the meanings of those shouldn't really be surprising. | means "or" for example, * means just what it usually does in grep or sed, and depending on the variation you're working with, parentheses can be used for grouping and brackets (these handsome fellas: []) used to indicate optional parts of an expression. Anyway, it doesn't really matter to get too bound up with syntax for the grammar "productions" (this is the fancy CS term that is used instead of "rules"), since nothing will be parsing this right now except your own brain, so we can get a little sloppy with syntax if it helps us.
Let's start with some simple statements like "Whitespace is made up of a run of one or more spaces, newlines, tabs, carriage returns, or line feeds," and "A digit is made up of a single 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9 character." This is how Backus and Naur would write that:
DIGIT ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" WHITESPACE ::= (" " | "\n" | "\t" | "\r" | "\f" )?
From there we build something called a DIGITSTRING as one or more digits, and finally define a number to be either one of those or two of them separated by a period. This allows for both integers and decimal numbers. We also define an operator to be any of the 4 arithmetic operations:
DIGITSTRING ::= DIGIT? NUMBER ::= DIGITSTRING | ( DIGITSTRING "." DIGITSTRING ) OPERATOR ::= "+" | "-" | "*" | "/"
So far, there's nothing tricky, really. Each of the productions matches a specific kind of thing and the more complex ones are made up of Now we get to the tricky, recursive bit. I should mention that the kind of parse we're building here is called a "recursive descent" parser. The "descent" part will make more sense later, but we need to talk about the "recursive" part now that we're looking at the last two productions in our grammar:
ATOM ::= EXPRESSION | NUMBER EXPRESSION ::= [WHITESPACE] "[" [WHITESPACE] ATOM [WHITESPACE] OPERATOR [WHITESPACE] ATOM [WHITESPACE] "]" [WHITESPACE]
These are relatively straightforward too, not much different from the other rules we've seen so far, but with a little twist. We're saying an atom is an expression or a number, and an expression is made up of an opening bracket, an atom, and operator, and a closing bracket, with optional whitespace in-between all of them. Notice that EXPRESSION appears in the definition of ATOM and vice-versa, so this is a recursive definition, which is unavoidable since the expressions we're trying to parse are recursive. Note that this doesn't create an infinite loop because we can eventually resolve each ATOM to be a number.
Now we actually have basically finished all of the necessary intellectual work. The work of actually creating a functional parser from here on out actually is just a small matter of programming. In fact, there is a class of software called "parser generators" or "compiler compilers" that just take in BNF statements like these and spit out the code for an actual parser. YACC is historical Unix-y one that outputs C code, but there are similar packages for most languages and environments.
That's cheating, though, so let's actually do the leg-work of writing it for ourselves. A "recursive descent" structure makes this pretty easy, since all we have to do is write one function for each of the rules we already have.
For example, the DIGIT rule becomes this relatively simple little function:
function parseDigit(input) { if( input[0] = "0" || input[0] = "1" || input[0] = "2" || input[0] = "3" || input[0] = "4" || input[0] = "5" || input[0] = "6" || input[0] = "7" || input[0] = "8" || input[0] = "9" ) { return input.slice(0, 1); } else { return false; } }
You can see that this just returns the character it recognizes as a digit, and returns false otherwise. This isn't especially useful, however, since the rest of the input string kind of just disappears. In order to give ourselves some more useful semantics for dealing with the input string, let's define a kind of utility object we'll call a charStream:
function charStream(str) { this.chars = str.split(""); } charStream.prototype.peek = function () { if(this.chars.length > 0) { return this.chars[0]; } else { return null; } }; charStream.prototype.eat = function() { return this.chars.shift(); };
Now, instead of dealing with strings directly, we have an object to keep some state for us and provide the nice methods "peek" and "eat", which allow us to look at the next character in the input, or consume the next character in the input, respectively. Now parseDigit() looks like this:
function parseDigit(inpt) { if(["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"].indexOf(inpt.peek()) != -1) { return inpt.eat(); } return false; }
There, that's a lot better. Now that we have parseDigit(), let's do parseDigitString():
function parseDigitString(inpt) { var out = []; var next; while (next = parseDigit(inpt)) { out.push(next); } if (out.length > 0) { return out.join(""); } else { return false; } }
Here we just collect digits from the input as long as there are digits to be had. After that, we check to see if we actually caught any, and if so return that. If not, we return false, indicating that there's no DIGITSTRING here.
Looking back at the grammar, I see now that we have everything we need to write parseNumber():
function parseNumber(inpt) { var whole_part, fractional_part; if (!(whole_part = parseDigitString(inpt))) { return false; } if(inpt.peek() ! ".") { return Number(whole_part); } inpt.eat(); //eat the decimal place so we can parse for the rest of the digits if(!(fractional_part = parseDigitString(inpt))) { return false; } return Number(whole_part + "." + fractional_part); }
Now, instead of just looking for single characters or strings of like characters, we look for a specific format of input. Namely, we need either a run of digits or two runs of digits with a period character in-between. We do this by trying to parse out each part individually and failing if any necessary pieces are missing. Finally, we use the Javascript Number() function to turn the strings into an actual Javascript number.
So now we should be able to parse numbers! Let's do some quick checks to make sure things are working like we expect them to:
> console.log(parseNumber(new charStream("100"))); 100 > console.log(parseNumber(new charStream("3.4"))); 3.4 > console.log(parseNumber(new charStream("2..2"))); false > console.log(parseNumber(new charStream(".1"))); false > console.log(parseNumber(new charStream("this is not a number"))); false
Everything looks good. We get numbers out when we put numbers in, and false out when we put in nonsense. The next logical step would be to write parseAtom():
function parseAtom(inpt) { var out; if (out = parseExpression(inpt)) { return out; } else if(out = parseNumber(inpt)) { return out; } return false; }
We can't actually test this yet, since it depends on parseExpression() and we haven't written that yet (notice that we'd be in the same situation even if we had written parseExpression() first, since they both depend on one another). So let's write it!
You'll notice from the grammar that we need also need to parse operators and whitespace in order to write parseExpression(). parseOperator() is easy enough:
function parseOperator(inpt) { if(inpt.peek() "+" || inpt.peek() "-" || inpt.peek() "*" || inpt.peek() "/") { return inpt.eat(); } return false; }
As for whitespace, a close look at the grammar reveals that it only contains optional whitespace; it's never required. I'm therefore going to preemptively make parseExpression() a lot less cluttered by extending charStream with this helpful method:
charStream.prototype.eatWhitespace = function() { while(this.peek() " " || this.peek() "\n" || this.peek() "\t" || this.peek() "\s") { this.eat(); } }
As you can see, this just advances the input tape to the next non-whitespace character. Now we can write parseExpression():
function parseExpression(inpt) { var out = {}; inpt.eatWhitespace(); if(inpt.peek() ! "[") { return false; } inpt.eat();//the brackets need to be there, but we don't actually need to keep them around for anything inpt.eatWhitespace(); if(!(out.operand1 = parseAtom(inpt))) { return false; } inpt.eatWhitespace(); if(!(out.operator = parseOperator(inpt))) { return false; } inpt.eatWhitespace(); if(!(out.operand2 = parseAtom(inpt))) { return false; } inpt.eatWhitespace(); if(inpt.peek() ! "]") { return false; } inpt.eat(); inpt.eatWhitespace(); return out; }
You can see that it looks a lot like parseNumber but slightly more complex. Since I want to output JSON objects at the end of all of this parsing, we initialize an empty object, and then check to see if we can parse each part of the expression in turn, eating whitespace in-between. We fail if we can't parse them. Otherwise, we stuff them into the appropriate object property. We return the out object at the end if everything's gone right. Let's try it out and see what we can make of the expression we so laboriously converted to fully-bracketed notation earlier:
> console.log( JSON.stringify( parseExpression( new charStream("[2 + [[4 * 10] / 2]]")), null, " ")); { "operand1": 2, "operator": "+", "operand2": { "operand1": { "operand1": 4, "operator": "*", "operand2": 10 }, "operator": "/", "operand2": 2 } }
It works!
Now, while this helps to show the basic shape of a recursive descent parser, it isn't really a very practical thing to use in production since it doesn't provide very good error messages or really any kind of exception handling at all. In a real-world parser, you would probably want the operand and operator properties of these objects themselves be objects instead of strings and numbers, and of course there's the small matter of actually operating on the resulting parse tree to do something useful with its output.
All of these are left as an exercise for the reader, or maybe me at some point, but not today.