Learning Go: Lexing and Parsing

Aug 24, 2016   #batch2  #golang  #parsers 

If you make up your own data format, you’re going to have to figure out how to parse it. I expanded last week’s warm-up problem to include accepting the input as a file. I’ll show you the format I came up with and then the code I wrote to parse it.

The exciting thing about my solution to the warm-up problem was that it was the first one I’d written from scratch. After I wrote that, I went back to following a tutorial, but I fumbled a lot less and learned a lot more, due to my greater understanding of Go syntax. It didn’t take long to solve the warm-up problem, but it took me the rest of the afternoon to write this lexer and parser. A large part of that time was spent figuring out how best to organize my Go code such that the tools would be maximally helpful.

Expanding The Problem

The original problem dealt with scaling the ingredients list for a recipe. While my Recipe type happens to have a nice syntax for writing literals, I want to be able to write my ingredients lists in text files in a format close to that of a real recipe. Before I write the parser, I need to define the format of my input files.

Defining the format precisely is key to making the implementation easy. My recipe format is pretty simple. One ingredient per line, in the form: <Number> <Unit> <Ingredient Name>.

Example File

4 t   sugar
5 T   brown sugar
1 q   heavy whipping cream
1 c   nuts
30T   butter

My formatting in this example is inconsistent in the sense that sometimes there’s whitespace between the number and the unit, and sometimes there isn’t. I’m also taking advantage of allowing any amount of tabs/spaces between words by making my ingredient names all start at the same column.

Whitespace

There can be whitespace between the number and the unit, but there doesn’t have to be (e.g. 3c and 3 c are both allowed). There has to be whitespace between the unit and the ingredient name. There can be whitespace within the ingredient name (e.g. “brown sugar”). Any whitespace before the number or after the last word of the ingredient name is ignored.

There are two kinds of whitespace for the purpose of this format: intraline whitespace (tabs and spaces) and interline whitespace (newlines). Any amount and mixture of intraline whitespace is allowed wherever the format allows “whitespace”. Interline whitespace is only allowed at the end of an ingredient name. No empty lines are allowed (except for a single empty line caused by a newline after the last ingredient in the list).

Numbers

The number can be an integer or a decimal. A number is made of digits and optionally a period (.) and more digits. If you include a period after a series of digits, you must have more digits after the period (e.g. 2. c is illegal).

I am ignoring fractions entirely because they’re not part of the original warmup problem and because it’s easier to just deal in decimals when programming. However, I do prefer fractions when using actual recipes, so maybe I’ll play with that some other time.

Units

The abbreviations for the supported units are keywords in this ingredient list mini-language.

  • t for teaspoon
  • T for tablespoon
  • c for cup
  • q for quart

The supported units and their abbreviations are taken from the warm-up problem spec.

Ingredient Names

One of the reasons the ingredient name comes last is that it can have multiple parts – multiple words separated by whitespace. The parser knows that the ingredient name is over when it encounters a new line. Ingredients can only contain letters and intraline whitespace, no fancy punctuation marks or unicode characters.

Designing My Parser

I went with a standard 2-stage process: a lexer and a parser. A lexer takes in the stream of characters from the file and produces a stream of tokens (which we’ll define). The parser takes the stream of tokens and produces a Recipe, which can then be used by my recipe scaling code.

Data Flow

This seems a bit like overkill for such a simple format – this is the same structure that “real” parsers for programming languages use. It’s true that I could have written one file/function that goes straight from a stream of characters to a Recipe, but that would be messier. If I wanted to add more features in the future, this design choice makes it easier.

The lexer contains all the concerns about whitespace and characters. It absorbs each stretch of spaces into a single “whitespace” token. It decides whether something is a word (part of an ingredient name) or a keyword (a unit). The lexer only cares about the structure at a very low level:

  • there should only be using alphanumeric characters and whitespace
  • t, T, c, and q are keywords when used alone (“quite ripe tomatoes” is a legal ingredient, no keywords involved)
  • words are separated by whitespace and don’t include digits

The parser has more high level concerns. It thinks in terms of tokens – a number, not a digit; a word, not a character. A parser thinks about things like:

  • each line should have a number, then a unit, then a word
  • ignore whitespace, except for newlines

This arrangement is easier to debug because each part is separately testable. The lexer can be given some sample text and have its output checked. The parser can be given a sample list of tokens and have its output checked. (I’m not that careful – I tested the lexer independently and then tested the parser on text – assuming that the lexer works. Writing lists of tokens is much more annoying than writing sample text.)

Organizing The Code

The code is split up along similar lines to the data flow diagram, but the call stack will be a bit different. The main function calls the parser with an open file (asking for a Recipe); the parser calls the lexer each time it wants another token; the lexer reads characters from the file until it finds a token. After the main function gets its Recipe, it can do whatever it wants, probably using the little recipe scaling library from the previous post.

Code Organization

This is also a fairly standard way to split up code. Each box on my diagram is a separate file. In a larger project, each color would be a different package. Because this problem is so small, I’m going to just have the main package (for the main function) and the tasty package (for the lexer, parser, and library code).

Arranging The Files

All of this code lives inside my rose-warmups repo. There are five files involved: a main file (executable main function), a lexer, a parser, a library file (for the scaling the ingredients), and a file for shared types (e.g. Recipe). From the start, I knew I wanted all of this code to live together in a tasty folder – having each problem’s solution in one folder lets me keep all these small projects together without having their unrelated solutions get jumbled together.

Unfortunately, my first attempt at this did not work. Putting all of the files in a directory rose-warmups/tasty doesn’t work because the main file has to be in the main package, but I want my parser and library code to be in a tasty package. Go doesn’t allow multiple packages to live in the same folder.

After glancing at some tutorials online, my second attempt was to make a separate folder – rose-warmups/cmd and put the main file in there. This compiles, since the tasty package is in the tasty folder and the main package is in the cmd folder. I figured I’d just keep adding executables to the cmd folder and give each supporting library (1 per problem) it’s own folder.

This idea continued until I had an executable that I actually built (go install rose-warmups/cmd). The executable ended up being named cmd, which is not what I wanted. The file was named tasty.go, which I hoped would result in a tasty executable. This threw into doubt the entire strategy of keeping the main function files in a single directory.

Through more confused tutorial skimming, I discovered that you can have subdirectories inside a folder that contains a package. The package names declared in the source files (e.g. package tasty) don’t even have to be related. (I expected to have to have a package tasty and then a package tasty/foo.) This allowed me to go back to my original plan of having one folder for the entire solution.

The folder for the whole solution is rose-warmups/tasty, and it has subfolder rose-warmups/tasty/lib. I put my tasty.go main function file in rose-warmups/tasty, which means I get an executable named tasty when I build. I put the library files (lexer, parser, etc) in the lib directory. This allows the code to all live together despite being split between the main and tasty packages.

It took a surprising number of tries to get to the point I wanted to be at. I’m glad that it turns out Go supports what I wanted to do, but it’s frustrating that it was so hard to find that out. Usually, I’m pretty good at picking Google queries, but I guess I have worse luck with Go (or the documentation has some holes).

Here’s the directory tree for my rose-warmups repo. There’s more than five files in lib, because I didn’t include the LICENSE file for the code based on the tutorial or the three test files in that count.

rose-warmups/
├── LICENSE
├── README.md
└── tasty
    ├── lib
    │   ├── LICENSE.txt
    │   ├── lexer.go
    │   ├── lexer_test.go
    │   ├── parser.go
    │   ├── parser_test.go
    │   ├── recipe.go
    │   ├── scale.go
    │   ├── scale_test.go
    │   └── tokens.go
    ├── sample.txt
    └── tasty.go

Writing a Lexer in Go

I like writing my parsers by hand. It gives me the most control over how things get parsed, and doesn’t require learning a new language just for this one task (e.g. lex/yacc, parsec). While it’s more important in bigger projects like programming languages, writing a parser by hand gives you much better control over your error messages – it’s a lot easier to make beautiful, human-friendly error messages if you’re hand-writing the parser.

Go makes hand-writing lexers easier by including bufio in the standard library; this “buffered io” means I can read one character at a time without having to worry about the performance implications. The bufio.Reader will take care of reading the file in more reasonable (i.e. bigger) chunks.

Because I’m new to go, I started from this sql parser tutorial. The code there is already structured into a lexer and parser, so large parts of my lexer are copy-pasted with minor edits. Therefore, I made sure to include the MIT license file from the sql parser’s github repo in my tasty/lib folder with a note about what files it applies to.

I’ve written parsers by hand before, so I largely skipped straight to the code examples rather than reading the prose part of the tutorial. This seems to have worked out fine, since I did end up with working code. However, it may affect my depth of understanding in this blog post.

The Types

The first code I write in most programs is the type definitions. I can’t write the functions/design the algorithm until I know the shape of the data. For fancy/fast algorithms, there’s some back and forth in the design process, but it always starts with a data structure initially.

With a lexer, the first thing to define is the tokens. These are the “words” of our language, and they include keywords, numbers, words, and some special characters. The definitions of the tokens depends on what is useful to the parser and what the lexer can know from looking only at the characters. Our language is very simple, so the lexer can find most of the interesting constructs – like the unit keywords – very easily. There are words for how easy a language is to parse, like “context-free grammar”, but I’m not interested in the theory at the moment.

// Represents a lexical token
type Token int

// Represent the kinds of tokens we can have in a recipe
const (
    // for anything we can't parse otherwise
	ILLEGAL Token = iota

	// special characters
	EOF
	NEWLINE
	PERIOD

 	// unit keywords
	KW_TEASPOON
	KW_TABLESPOON
	KW_CUP
	KW_QUART

    // multicharacter constructs
    WS
	WORD
	INTEGER
)

An ILLEGAL token will only be produced when the file is invalid – this is what happens when the lexer can’t understand a character. There are three special characters - EOF for the end of the file/input, NEWLINE for \n, and PERIOD for .. Because the quantity of an ingredient can be written as a floating point number, the lexer/parser need to handle two forms of number: DIGITS and DIGITS PERIOD DIGITS. I chose to handle this by having the lexer produce tokens for a series of digits and for a period. The parser will have to put the floating point number back together from three tokens.

There are four units handled by the Recipe type I want to parse, so there are four unit keywords in the token enum. These are pretty straightforward - a c that’s not part of a word will parse to KW_CUP. The lexer will take care of making sure the c in the word cat doesn’t end up as KW_CUP.

The last three token types are for whitespace, words, and digits. Whitespace is only intraline whitespace – spaces and tabs. Words are whitespace-separated words in an ingredient name. INTEGER is the name I picked for a series of digits; on reflection, DIGITS is possibly a better name. In the vein, LETTERS is probably a clearer name for WORD, in the sense that it focuses on what the lexer is actually promising when it produces the tokens.

I went into more detail about the model-related types in the previous post, but I’ll repeat the relevant one here: the unit enum.

type Unit int
const(
 	TEASPOON Unit = iota
 	TABLESPOON
 	CUP
 	QUART
)

There’s one more type to cover, but it’s a type that’s for representing the lexer and not something customized to the ingredient list language.

// Scanner represents a lexical scanner.
type Scanner struct {
	r *bufio.Reader
}

// NewScanner returns a new instance of Scanner.
func NewScanner(r io.Reader) *Scanner {
	return &Scanner{r: bufio.NewReader(r)}
}

The type representing my lexer is Scanner. I think Go ascribes some special meaning to the term scanner, but I don’t know what that is. I just copied the type name from that tutorial. The neat thing here is the bufio.Reader that does the work of reading from a file and buffering small reads for me. NewScanner just wraps a bufio.Reader around a normal io.Reader that most file opening functions return. having a constructor here avoids boilerplate (i.e. wrapping all my files in bufio.Readers before lexing) and hides an implementation detail.

The new Go syntax here is *, which has the same meaning as in C/C++. * and & are the two pointer special characters. * in a type name means that it’s a pointer to that type – Scanner is an instance of the Scanner type and *Scanner is a pointer to an instance of the Scanner type. When working with values, * dereferences a pointer (e.g. if sp is a *Scanner, then *sp is a Scanner) and & does the opposite (e.g. if s is a Scanner, then &s is a *Scanner). It took me a long time to consistently remember which character does which thing, but it’s useful just to remember that if you see & or * in front of a variable name or otherwise used as a single-argument operator, then something pointer-related is going on. It’s a lot easier to look up “pointer syntax in Go” via google then to try to search for the meaning of punctuation marks.

The Functions

Now let’s look at the actual code that turns characters into tokens. The first thing I did was define classes of characters that are relevant to my little ingredient list language: intraline whitespace, letters, and digits.

// Returns true if ch is a space or a tab.
func isWhitespace(ch rune) bool {
	return ch == ' ' || ch == '\t'
}

// Returns true if ch is a upper or lowercase letter.
func isLetter(ch rune) bool {
	return (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')
}

// Returns true if ch is a numeric digit
func isDigit(ch rune) bool {
	return (ch >= '0' && ch <= '9')
}

The odd thing you might notice is that these functions take runes, but I said we were categorizing characters. This is a neat feature of Go – runes cover both ascii characters (the only ones allowed in my language) and unicode code points. I’m glad that unicode handling is built-in to the normal way of handling characters/strings. Because runes represent characters in UTF-8, I can play the normal character range tricks for checking if a character is a letter/number without needing to check all the possible letters/numbers individually.

Characters are represented by numbers (as is everything a computer understands). There are 127 ASCII characters, represented by the numbers 1-128. There are standard meanings for the values 0-128 as ASCII characters (e.g. A is 65). The largest value need to represent an ASCII character (i.e. 128) takes 7 bits to represent. The 8 in UTF-8 represents the fact that a UTF-8 character takes 8 bits to represent. The first 128 values are used to represent ASCII characters, and then different meanings are assigned to bytes with higher values.

I’m not sure exactly how UTF-8 uses the higher values to represent all the unicode characters, but the important thing about UTF-8 for compatibility is that if you represent normal ASCII characters in it, they have the same values as they normally would. There’s no difference between ASCII characters stored as 8-bit values (which is standard, since a byte is 8-bits) and those characters stored as UTF-8.

Next, I wrote the two most basic functions of the Scanner – reading a character and un-reading a character. The other functions for lexing a token from the bufio.Reader will be reading one character at a time. If they read a character they don’t want (i.e. when the whitespace parser finds a non-whitespace character), they can put the character back (so the next function can use it) and return.

// read reads the next rune from the buffered reader.
// Returns the rune(0) if an error occurs (or io.EOF is returned).
func (s *Scanner) read() rune {
	ch, _, err := s.r.ReadRune()
	if err != nil {
		return rune(0)
	}
	return ch
}

// unread places the previously read rune back on the reader.
func (s *Scanner) unread() { _ = s.r.UnreadRune() }

The (s *Scanner) before the function names seems to make this a method of *Scanner because I call these methods using the syntax sc.read(), where sc is a *Scanner. The error handling is pretty basic – read returns the same rune (rune(0)) for both a normal condition (EndOfFile) and any error condition. Given the intent of this program, this meets my reliability criteria – “works well in normal circumstances on my computer”. unread is interesting in that it using an _ to catch the return value of bufio.Reader.UnreadRune; I think this is just a signal to the compiler that the return value is nil, not the return value of the function call. I’m guessing this is either ignoring another error, or ignoring a return value of “the rune you just un-read”.

I know we just started at the very bottom of the scanner (the lowest level functions), but I’m going to jump to the highest level function (the one other files will call) so you can see the control flow. This function is called Scan, and it returns the next token in the file/input. It doesn’t lex the file all at once, it waits to be asked for each new token.

When asked to find the next token, the first thing to do is read the next character. If that character is a digit, letter, or tab/space, Scan calls the matching function – the implementation of which I’ll show you in a moment. If none of those functions apply, it checks for special single-character tokens (EOF, newline, period). If the character still hasn’t matched to anything, it’s an illegal character so Scan returns ILLEGAL.

// Scan returns the next token and literal value.
func (s *Scanner) Scan() (tok Token, lit string) {
	// Read the next rune.
	ch := s.read()

	// If we see whitespace then consume all contiguous whitespace.
	// If we see a letter then consume as a word or reserved word.
	// If we see a digit then consume as a number.
	if isWhitespace(ch) {
		s.unread()
		return s.scanWhitespace()
	} else if isLetter(ch) {
		s.unread()
		return s.scanString()
	} else if isDigit(ch) {
		s.unread()
		return s.scanDigit()
	}

	// Otherwise read the individual character.
	switch ch {
	case rune(0):
		return EOF, ""
	case '\n':
		return NEWLINE, string(ch)
	case '.':
		return PERIOD, string(ch)
	}

	return ILLEGAL, string(ch)
}

There are six possible categories a legal character could fall into – tab/space, letter, digit, EOF, newline, or period. This code checks whether a character is in each category in that order, but the order doesn’t matter since the categories are exclusive (so only one will match any given character). The functions for scanning whitespace/digits/letters all expect to use s.read to get all of their characters, so Scan uses unread before calling them.

There are three functions for lexing the tokens that can take up more than one character: scanWhitespace, scanString, and scanDigit. The structure of each of these is similar – keep absorbing characters into a buffer until you find a character that doesn’t match the criteria (e.g. a character other than or \t).

// scanWhitespace consumes the current rune and all contiguous whitespace.
func (s *Scanner) scanWhitespace() (tok Token, lit string) {
	// Create a buffer and read the current character into it.
	var buf bytes.Buffer
	buf.WriteRune(s.read())

	// Read every subsequent whitespace character into the buffer.
	// Non-whitespace characters and EOF will cause the loop to exit.
	for {
		if ch := s.read(); ch == eof {
			break
		} else if !isWhitespace(ch) {
			s.unread()
			break
		} else {
			buf.WriteRune(ch)
		}
	}

	return WS, buf.String()
}

The bytes.Buffer is some kind of expandable array. I’ve read that Go has fixed-sized arrays and adjustable slices, and watched my boyfriend struggle with various edge cases of slices. This seems to be something different, since there’s no [] involved, but I’m not sure what’s different underneath. Maybe it’s a nice api on top of a slice of runes?

The main control flow in this function is a loop. Because the function might absorb any number of tabs/spaces, it uses a forever loop – equivalent to while true in languages with while loops and for(;;) in C. Go only has one loop keyword - for. This loop covers all cases – forever, while, do, for, for-each. This makes the parser simpler by removing some keywords/syntax, but I’m not sure if its that much of a simplification for programmers. You still need to remember the right syntax to make it do the thing you want, even if it always starts with the same keyword.

The next function, scanDigit, is very similar, except that it absorbs digits (0-9) rather than tabs and spaces.

// scanDigit consumes the current rune and all contiguous digit runes.
func (s *Scanner) scanDigit() (tok Token, lit string) {
	// Create a buffer and read the current character into it.
	var buf bytes.Buffer
	buf.WriteRune(s.read())

	// Read every subsequent ident character into the buffer.
	// Non-ident characters and EOF will cause the loop to exit.
	for {
		if ch := s.read(); ch == eof {
			break
		} else if !isDigit(ch) {
			s.unread()
			break
		} else {
			_, _ = buf.WriteRune(ch)
		}
	}

	// Otherwise return as a regular identifier.
	return INTEGER, buf.String()
}

I’m not sure why the else case here is any different than in scanWhitespace. I’m pretty sure I copy-pasted that, but probably from a different function (scanIdent) in the same source.

The last function, scanString is a little different. After pulling all the contiguous letters into a buffer, it checks for the unit keywords (t, T, c, q) and returns special token types for those. Anything else becomes a WORD token.

// scanString consumes the current rune and all contiguous letter runes.
func (s *Scanner) scanString() (tok Token, lit string) {
	// Create a buffer and read the current character into it.
	var buf bytes.Buffer
	buf.WriteRune(s.read())

	// Read every subsequent ident character into the buffer.
	// Non-ident characters and EOF will cause the loop to exit.
	for {
		if ch := s.read(); ch == eof {
			break
		} else if !isLetter(ch) {
			s.unread()
			break
		} else {
			_, _ = buf.WriteRune(ch)
		}
	}

	// If the string matches a keyword then return that keyword.
	switch buf.String() {
	case "t":
		return KW_TEASPOON, buf.String()
	case "T":
		return KW_TABLESPOON, buf.String()
	case "c":
		return KW_CUP, buf.String()
	case "q":
		return KW_QUART, buf.String()
	}

	// Otherwise return as a regular identifier.
	return WORD, buf.String()
}

The switch statement here is the code that actually makes certain letters mean particular units, so it’s good that it’s easy to see what letter goes to what unit. This seems like a potential spot to have a bug if you’re writing your first parser because all of the keywords are 1 character, so it’s a bit tempting to just check for “keyword characters” rather than whole words. That would create a bug where you couldn’t have an ingredient names contain/start-with the unit letters.

The Tests

You’ve now seen all the code for the lexer; it’s under 150 lines total. The last thing to do before moving on to the parser is to write some tests. It’s easier to do a round of debugging on the lexer now than have to deal with lexer and parser bugs at the same time later.

I’ve got three unit tests written for the lexer, but I’m going to just show you one because they just aren’t that interesting or different from each other. Each test has a string literal for the input and a slice of TokenInstances for the expected output. TokenInstance is a pair of a Token type and a String that was matched. This is basically what Scanner.Scan returns, but Go doesn’t have tuples except as function return types, so I need to created a struct.

// Represents a token type and the literal characters it was parsed from.
type TokenInstance struct {
	Type    Token
	Literal string
}

While it is a bit annoying to be defining new types every time I just want to make a list of pairs, I do like that there’s more names on my types. I’ve read enough legacy ruby code to know how frustrating tracking down the type of random arguments can be. It wouldn’t be as bad in Go since you do have to declare argument types, but it’s nice to have the exact shape of a type documented by a definition (vs. passing around maps everywhere).

The test will setup the string and expected output, and then compare the actual output to the expected output. It will use t.Error where t is a testing.T for any mismatches. t.Error causes the test to fail and its argument is the error message.

func TestLexingIngredients(t *testing.T) {
	var r io.Reader = strings.NewReader("2c mayo\n  8 T brown sugar\n 2t cayenne\n")
	var correct_parsed []TokenInstance
	correct_parsed = []TokenInstance{
				TokenInstance{INTEGER, "2"},
				TokenInstance{KW_CUP, "c"},
				TokenInstance{WS, " "},
				TokenInstance{WORD, "mayo"},
				TokenInstance{NEWLINE, "\n"},
				TokenInstance{WS, "  "},
				TokenInstance{INTEGER, "8"},
				TokenInstance{WS, " "},
				TokenInstance{KW_TABLESPOON, "T"},
				TokenInstance{WS, " "},
				TokenInstance{WORD, "brown"},
				TokenInstance{WS, " "},
				TokenInstance{WORD, "sugar"},
				TokenInstance{NEWLINE, "\n"},
				TokenInstance{WS, " "},
				TokenInstance{INTEGER, "2"},
				TokenInstance{KW_TEASPOON, "t"},
				TokenInstance{WS, " "},
				TokenInstance{WORD, "cayenne"},
				TokenInstance{NEWLINE, "\n"}
			}

	s := NewScanner(r)
	for _, ti := range correct_parsed {
		token, literal := s.Scan()
		if token != ti.Type {
			t.Error("Expected ", ti.Type, " but got ", token, " with value ", literal)
		}
		if literal != ti.Literal {
			t.Error("Expected ", ti.Literal, " but got ", literal, " with type ", token)
		}
	}
	token, _ := s.Scan()
	if token != EOF {
		t.Error("Expected EOF, but got ", token)
	}

}

The []TokenInstance literal here is really annoying to type, but someone has to generate the correct answer, and I haven’t implemented a test generator for the lexer yet. In my correct answer, it was important to make sure my whitespace strings had the correct number of spaces – for example, the token before 8 has two spaces. Having the scanner return the matching substring for all tokens (even obvious ones like NEWLINE) was a design choice from the tutorial I used; I appreciate how easy it makes debugging the lexer – you can see exactly what happened to make the lexer output that token.

Writing a Parser in Go

I’ve shown you the lexer and its test, so now I’m ready to show you the parser and its test. The code for the parser is about the same size as the lexer’s (150 lines). You should notice a similarity in structure – the lexer goes from a stream of characters to a stream of tokens, and the parser goes from a stream of tokens to a list of ingredients. They both absorb one or more items from their input into one item of output.

The Types

The type Scanner is a struct containing a bufio.Reader; the Parser type contains a *Scanner – the lexer type. But that’s not all it has!

The type to represent the parser is a bit more complicated than Scanner because Scanner doesn’t implement unread for us like bufio.Reader does for the stream of characters. Instead, Parser saves the last character it got from Scanner, providing one level of undo.

// Parser represents a parser.
type Parser struct {
	s   *Scanner
	buf struct {
		tok         Token  // last read token
		lit         string // last read literal
		isUnscanned bool   // true if you should read buf first
	}
}

tok and lit are the return values from Scanner.Scan; isUnscanned is the flag for whether the Parser has officially read the token yet. If the token is in buf then the token has been read, but if isUnscanned is set to true, we’ve hit undo and are pretending not to have seen it yet.

This design is, once again, from the tutorial. I think it’s pretty neat to realize you only need one level of undo for a language this simple and to see how easy it is to implement.

I’ll again repeat the relevant types from the previous post. In this case, that means the model type - Recipe - and a type it depends on, Amount.

type Amount struct {
	Quantity float64
	Unit     Unit
}
type Recipe map[string]Amount

Recipe is a map from strings (ingredient names) to Amounts (i.e. 4 cups). I like the built-in syntax for maps in Go, and the fact that there is just one built-in map to default to.

The Functions

When I want to use my parser, the first thing I’ll need to call is the constructor. This takes an io.Reader, which is both what opening a file gives you and what the Scanner's constructor’ is expecting.

// NewParser returns a new instance of Parser.
func NewParser(r io.Reader) *Parser {
	return &Parser{s: NewScanner(r)}
}

The constructor makes a new Scanner from the io.Reader, and then makes a Parser from the Scanner. Interestingly, even though the Parser struct has two elements, only one of them is being set here, and it’s being set by name (s).

As I did when showing you the lexer, I’ll start with the lowest level functions – the ones that read and unread one token. The Parser.scan function is the one that actually interacts with the lexer; it calls Scanner.Scan to get the next token. Before calling Scanner.Scan though, the Parser.scan function checks whether there’s already a token waiting in the buffer. If there is, it returns that token. Otherwise, Parser.scan gets the next token, stores it in the buffer, and returns it.

// scan returns the next token from the underlying scanner.
// If a token has been unscanned then read that instead.
func (p *Parser) scan() (tok Token, lit string) {
	// If we have a token on the buffer, then return it.
	if p.buf.isUnscanned {
		p.buf.isUnscanned = false
		return p.buf.tok, p.buf.lit
	}

	// Otherwise read the next token from the scanner.
	tok, lit = p.s.Scan()

	// Save it to the buffer in case we unscan later.
	p.buf.tok, p.buf.lit = tok, lit

	return
}

// unscan pushes the previously read token back onto the buffer.
func (p *Parser) unscan() { p.buf.isUnscanned = true }

The syntax for scan's return type here is interesting: it has variable names and types, rather than just types. Also strange: the return statement is just return, without any return values. This is a piece of Go-magic (every language has some magic) – if you name the variables you’re going to return, their values when you get to the return line will be the return values of the function. In scan, the values of tok and lit will be used as the return values if it reaches the return line. If the function finishes early, at return p.buf.tok, p.buf.lit, then those specified values will be used instead.

I’m not sure I like Go’s “local variable names should be very short” style preference. I don’t find p.s.Scan() to be as easy to read as parser.scanner.Scan(), but it does seem less annoying in general than the super long variable names I’ve remember being encouraged to use in college.

On top of scan and unscan, there’s a function scanIgnoreWhitespace that skips a WS token if it’s next in the stream and instead returns the token after the WS token. Because the lexer absorbs all neighboring tabs and spaces into a single token, there will only ever be one WS token, not two in a row.

// scanIgnoreWhitespace scans the next non-whitespace token.
func (p *Parser) scanIgnoreWhitespace() (tok Token, lit string) {
	tok, lit = p.scan()
	if tok == WS {
		tok, lit = p.scan()
	}
	return
}

This function again uses the named return variables in the signature. I can see the convenience better here – there are two places that tok and lit are set, but only one return line. You get a little extra convenience from making your variable names consistent, which probably makes larger functions more readable.

I did a double take when I first looked at this function because it wasn’t obvious to me that there would only ever be one whitespace token in a row. I’m glad I was following a tutorial because assuming they were right helped me see the reason why you only have to look on ahead more quickly. If I were writing this unguided, I might have written a version that absorbed any number of whitespace tokens. While that wouldn’t have intrinsically caused bugs, it would have made the function more complex the necessary and could have hidden bugs in the lexer.

As with the lexer, I’ll jump straight from the low-level functions up to the top-level entry point before showing the helper functions in between. The Parse function is the entry point to the parser; it returns an entire ingredients list. It depends on two functions I’ll show you later (parseNumber and parseIngredientName), but otherwise you’ve seen all the types and functions Parse uses.

The return value is a tuple – (*Recipe, error). The parser would like to return a Recipe, but in case of an error, it will return a (nil, error). To allow for nil as a possible value, you use a pointer type (the *). The lexer didn’t need to do this, since all read errors were indicated as EOF and all other errors were illegal characters, represented by an ILLEGAL token.

// Parses one entire ingredients list.
// Expects one ingredient per line, in the form: Number Unit IngredientName NewLine
func (p *Parser) Parse() (*Recipe, error) {
	var r Recipe = make(Recipe)
	for {
		value, err := p.parseNumber()
		if err != nil {
			return nil, err
		}

		tok, lit := p.scanIgnoreWhitespace()
		var unit Unit
		switch tok {
		case KW_TEASPOON:
			unit = TEASPOON
		case KW_TABLESPOON:
			unit = TABLESPOON
		case KW_CUP:
			unit = CUP
		case KW_QUART:
			unit = QUART
		default:
			return nil, fmt.Errorf("found %q, expected a unit", lit)
		}

		ingredient, err := p.parseIngredientName()
		if err != nil {
			return nil, err
		}

		r[*ingredient] = Amount{*value, unit}

		tok, lit = p.scanIgnoreWhitespace()
		if tok == EOF {
			return &r, nil // we got it! :)
		} else {
			p.unscan()
		}
	}
}

The function starts by creating an empty Recipe and then enters a forever loop. The loop will continue until it either encounters an error (which means the file’s syntax isn’t valid) or it reaches the end of the stream of tokens. All of the return nil, err lines represent stopping early due to an error; the return &r, nil line represents successfully reaching the end of the token stream and returning the complete ingredients list.

The unit parsing could have gone in a separate function, but I chose to put it directly in Parse. I like seeing the keyword to unit conversion right at the top, and I feel that since it always only takes one token, it’s simple enough not to need a separate function.

The line r[*ingredient] = Amount{*value, unit} is where each individual ingredient entry is created. Since this happens incrementally (one line at a time), I’ve been considering returning the partial recipe in the error cases, to allow providing better debugging info to users. It would be helpful to be able to say “this is how far we got before the error”, but that’s partially because the lexer/parser don’t keep track of line & character numbers – the most basic information for debugging is “where were you when the bad thing happened?”.

Next, I’ll show you parseNumber and then parseIngredientName. The recipe language allows integers or decimal numbers, no fractions. The parseNumber function expects to see an INTEGER token, representing a contiguous series of digits. After that, it expects to optionally see a PERIOD and then another INTEGER. If the next token is not PERIOD (after the first INTEGER, the function unscans because the non-PERIOD token is not useful to parsing a number. If the first token it tries to read is not an INTEGER or if the token after the PERIOD is not an INTEGER, then it returns an error. The erroring setup is the same as the Parse function – in the good case it returns (*float64, nil) and in the bad cases it returns (nil, error).

func (p *Parser) parseNumber() (*float64, error) {
	tok, lit := p.scanIgnoreWhitespace()
	if tok != INTEGER {
		return nil, fmt.Errorf("found %q, expected number", lit)
	}
	var first_half string = lit
	tok, _ = p.scan()
	if tok == PERIOD {
		tok, lit = p.scan()
		if tok != INTEGER {
			return nil, fmt.Errorf("found %q, expected second half of float", lit)
		}
		f, err := strconv.ParseFloat(first_half+"."+lit, 64)
		if err != nil {
			return nil, err
		} else {
			return &f, nil
		}
	} else {
		p.unscan()
		f, err := strconv.ParseFloat(first_half, 64)
		if err != nil {
			return nil, err
		} else {
			return &f, nil
		}
	}
}

There’s some string concatenation and parsing here – f, err := strconv.ParseFloat(first_half+"."+lit, 64) – for handling decimals as floating point numbers. The 64 is the kind of floating point number I’m choosing to parse from the decimal string – I chose float64s, rather than float32s, because I default to using float64s. Taking my INTEGER PERIOD INTEGER string and having Go parse it as a floating point number seems like the easiest way to go from the text to a number.

The other helper function, parseIngredientName, has a lot in common with the lexer functions. parseIngredientName scans through the tokens, collecting all the words and ignoring intraline whitespace tokens. The general structure, of making sure the first token is a word and then continuing in a forever loop until a token that’s not a word or whitespace is found, is the same as what the lexer functions have. The non-error ending condition is more specific here, since an ingredient name is at the end of a line. The function returns successfully if it finds a NEWLINE or EOF token, but returns an error if it finds a token other than a word or whitespace.

func (p *Parser) parseIngredientName() (*string, error) {
	var name *string
	tok, lit := p.scanIgnoreWhitespace()
	if tok == WORD {
		n := lit
		name = &n
	} else {
		return nil, fmt.Errorf("found %q, expected word", lit)
	}

	for {
		tok, lit = p.scanIgnoreWhitespace()
		if tok == NEWLINE {
			return name, nil
		} else if tok == EOF {
			p.unscan()
			return name, nil
		} else if tok != WORD {
			return nil, fmt.Errorf("found %q, expected word", lit)
		} else {
			n := *name + " " + lit
			name = &n
		}
	}
}

The string concatenation here looks a little funny – it’s spread over two lines with a useless looking helper variable, n. The issue here is that Go seems to require that you have a named variable when you take a reference (&) to it. If I in-line the *name + “ “ + lit into name = &n, I get a compile error. The work around is easy enough, it just looks weird, which is a sign that I don’t really understand Go’s pointers yet.

The Tests

You’ve seen all the code for the parser, so now I can show you one of the tests. I have two, but they’re pretty similar, but I’ll just show you one of them because the structure is very similar. This test uses the same built-in testing framework as the lexer test did, with the testing.T argument.

The general outline of the test is: Setup input string Setup correct output (Recipe) Parse the input string Check that the parsed output is the same as the correct output

For #4, I have two foreach loops – one to loop over the keys in the correct answer and a second one to loop over the keys in the parsed answer. This lets me check that both all the correct keys are there and that there aren’t any extra keys in the parsed version.

func TestParsingIngredients(t *testing.T) {
	var r io.Reader = strings.NewReader("2c mayo\n  8 T brown sugar\n 2t cayenne\n")
	correct_recipe := Recipe{
		"mayo":        Amount{2, CUP},
		"brown sugar": Amount{8, TABLESPOON},
		"cayenne":     Amount{2, TEASPOON},
	}

	var p *Parser = NewParser(r)
	pr, err := p.Parse()
	if err != nil {
		t.Error("Got parsing error: ", err)
		return
	}
	parsed_recipe := *pr
	for key, value := range correct_recipe {
		if parsed_recipe[key] != value {
			t.Error("Expected ", value, ", but got ", parsed_recipe[key])
		}
	}
	for key, value := range parsed_recipe {
		if correct_recipe[key] != value {
			t.Error("Expected ", correct_recipe[key], ", but got ", value)
		}
	}
}

The main different from the lexer test is that the lexer’s Scan function produces one token at a time, but the parser’s Parse function consumes the entire input at one time. It would also be fine for the parser to consume one line at a time – many programming language parsers would return on expression at a time, and one ingredient here would be equivalent.

Go is helpful in making sure you don’t assume your map’s keys will come out in any particular order. I ran into this when printing out Recipes, not writing this test, but Go actually randomizes the order of the keys when you iterate over a map. This is a feature to keep programmers from accidentally depending on the order. I was frustrated that my printed-out keys kept coming out in different orders (making it hard to manually compare them), but I’m really happy that I’ve found out that Go is helping me avoid non-deterministic bugs in my code.

Conclusion

I enjoyed writing in Go more than I expected to. When I’ve heard Go described as “a better version of C” or “C++ done right”, I expect it to be a hassle to write or to learn– like memory management in C or the infinite minutiae of C++. However, it turns out that the key words were “better” and “done right”, not C/C++.

Go does borrow a lot of syntax from C/C++, but I’m a lot more likely to choose to start a new project in Go than either of those, barring special constraints. I wrote C++ when I was at Google, and I really enjoyed learning it. Google is a great place to write C++ – they keep up with the new versions, they have all the tooling, they use the same subset of the language throughout their codebase, and there are lots of expert C++ programmers to go to with questions. I would not have qualms about working on another project at Google in C++.

However, despite a year of writing C++ at work, I wouldn’t feel comfortable writing C++ on a for-fun project. I was adding code to an existing project, and had a lot of guidance on how to structure the new code I wrote. The Google tools and libraries that I learned to use are nearly all closed-source and only available at Google. (Exception: asan/tsan/msan were very useful, and are available outside of Google.)

Go is a lot easier to use, and I don’t feel like I need expert guidance to help me pick which features and libraries to use. I have not yet run into seemingly inexplicable syntax or header file race conditions. C++ is a good language for certain circumstances – large companies with experienced programmers writing performance-sensitive code – but not for the rest of my Recurse batch. I’m still planning on playing with HaikuOS (an operating system written in C++) a bit more, but probably less than I would have if I hadn’t taken up Go.