Regex vs. Lexing

2020-05-10 00:00:00 dev pl howto

When All You Have is Regex, Everything is a Pattern

I often see developers use regular expressions in the process of parsing languages such as DSL, data serialization formats, and config file formats. Regular expressions are powerful, concise, supported in lots of languages as a sort of sub-language, and they are already known to many programmers, so they can be a good solution here. Obviously some people are finding success using them this way.

But there’s another way to approach the problem of scanning a program for its syntax elements: a native-code lexical analyzer, or lexer. I want to do an in-depth comparison of lexing with the regular expression approach.

Syntax Scanning in a Nutshell

The part of the overall parsing process that typically employs regex involves the identification of every element of syntax in a program; assume we’re leaving the task of interpretation and/or execution to later stages (not everyone does this in practice, but it simplifies the exploration). Consider the language behind this example program:

c = 1;
m = 21 ;
d = c^2;
e=m*d;

About this syntax: let’s call an indivisible element of syntax a “token.” This concept of a token has two attributes - its type, which is a hint used in later semantic analysis, and its lexeme: the literal value in the program (4). There are a few different kinds of tokens in the language in which this program is written:

Type	Example Lexeme
symbol	The `c` in `c = 1;` is a symbol token. A symbol could be any letter.
equals	The `=` in `c = 1;` is an *equals token.
integer	The `21` in `m = 21 ;` is an integer token; so is the `2` in `d = c^2`.
exponentiate	The `^` in `d = c^2;` is an exponentiate token.
multiply	The `` in `e = m d;` is a multiply token.
eos	The `;` in `c = 1;` is a eos token; it marks the end of the statement.

In general the type must always be determined as a result of lexing. The lexeme for some types won’t affect its interpretation, e.g., an equals is always an = and has the same meaning. However for some cases, e.g. integer, the type has a range of meaning which is made specific by the lexeme value – it could be 1 or 21, or any other integer. For types like integer, the lexeme represents information that must be preserved and passed on in order for the token to be correctly interpreted by later stages. Non-printable characters like spaces and newlines have no meaning here, so are not tokenized; they only serve to make the program easier to read for a human.

Those are all the rules. You could of course have more rules, or different rules, but these are the ones we’re using for the exercise. This seems fair enough: usually, you don’t get to decide the rules for the language you’re parsing, and the few features included in this example are fairly typical.

Regex Tokenization

The regex for naively tokenizing by spaces is very concise: [^\s]+ (meaning: match any contiguous run of one or more non-whitespace characters), but fails for the above example, because it yields the following incorrect set of tokens, delimited by angle brackets <>:

<c><=><1;><m><=><21><;><d><=><c^2;><e=m*d>

Based on the given syntax rules, the correct token stream would be:

<c><=><1><;><m><=><21><;><d><=><c><^><2><;><e><=><m><*><d><;>

Bummer! We can’t just add spaces around every printable character either, because <21> would become <2><1>, which is incorrect. So the expression needs to be slightly more complicated. Based on our above table of tokens, here are the sub-patterns for each:

Token	Sub-Pattern
symbol	`[A-Za-z]`
equals	`=`
integer	`\d+`
exponentiate	`\^`
multiply	`\*`
eos	`;`

We combine of all these patterns to get a correct tokenization in a single expression [A-Za-z]|=|\d+|\^|\*|;, (meaning: match pattern for symbol or pattern for equals… etc.); this yields the correct result:

<c><=><1><;><m><=><21><;><d><=><c><^><2><;><e><=><m><*><d><;>

This is half of what we need – the lexeme for each token. The type of each token is not identified in the evaluation of the regular expression, so we’ll have to pass over each literal, test it against each sub-expression, and evaluate the type for a lexeme.

program = 'c = 1;\nm = 21 ;\d = c^2;\ne=m*d;'
lexemes = matchAll('[A-Za-z]|=|\d+|\^|\*|;', program)
// that gives us: [<c>, <=>, <1>, <;>, <m>, ... , <;>]

patterns = [
    {type=symbol, pattern='[A-Za-z]'},
    {type=integer, pattern='\d+'},
    ...
    {type=eos, pattern=';'},
]

tokens = []

for lexeme in lexemes
    for pattern in patterns
        if pattern.match(lexeme)
            tokens.append(
                Token{type=pattern.type, lexeme=lexeme)})
            break  // if found, move on to next lexeme

This algorithm makes a key assumption: that lexemes unambiguously match to one pattern. If a lexeme is matched by more than one token pattern, it implies that you might need additional information from context in order to decide which of multiple matches is the correct one (2)(3). Many languages do not have this kind of contextual ambiguity, or if they do, it can be fairly easily worked around. (A good example might be Python’s use of meaningful indentation levels: matching indent tokens requires contextual information, such as the number and type of characters in the lexemes of previous indent tokens)

The results of the above evaluation of each token type gives us complete tokens in the order they appear in the program:

<symbol:c><equals:=><integer:1><eos:;><symbol:m>
<equals:=><integer:21><eos:;><symbol:d><equals:=>
<symbol:c><exponentiate:^><integer:2><eos:;><symbol:e>
<equals:=><symbol:m><multiplication:*><symbol:d><eos:;>

Done; this can be passed on to a parser for semantic analysis.

Hints/Lessons

Some observations from this exercise:

Regex gives a very concise way of getting tokens out of this context-free set of syntax rules, and requires very little surrounding native code to evaluate each token type. It might get a bit more difficult for languages with more complex rules.
A minimum of two passes is made: one for the tokenization, one for the evaluation. As long as it’s fast enough in practice, that might not matter too much, but it’s something to consider.
There are already some hints about another approach. We do two passes here. If only we could modify the regex engine to tell us which pattern passed, it could give us the type as well. This separates the regex engine into many copies that don’t know about each other - it’s probably less efficient because the states of each individual search are now isolated from each other (1)(2).

It seems like the more we try to use this general tool for our specific purpose, the closer we get to turning the regex part inside out to add the features we need. This is the same dilemma encountered when making any choice between using something that is closely related to your problem and writing a custom solution. In this case: how much of the regex search machinery would we have to reimplement to provide our lexeme search so that we get direct access to token types without having to repeat searches? (Or is there a way around this with regex that I may not aware of? Let me know please). Let’s see what it’s like to write something that can efficiently search for our specific patterns and seamlessly solve both problems.

Note that this doesn’t mean reimplementing regular expressions! Regex is a factory that creates whatever matcher you specify. We just want a very specific matcher for our language rules.

Native-Code Tokenization

We’re being led naturally to exploring native-code lexer option: a pattern matcher, like the ones generated by regex, but for just this specific set of rules. Writing this ourselves gives us the ability to get the lexemes, and the token type in a single pass (1)(4).

I could just show how this is done since it’s an established art (1)(3)(4), but let’s increment on the regex example, in the hope that the differences in approach are made clearer. Let’s start by doing away with the combined tokenization regex, and just doing the tokenization alongside the evaluation using each token pattern separately.

program = 'c = 1;\nm = 21 ;\d = c^2;\ne=m*d;'
// lexemes = matchAll('[A-Za-z]|=|\d+|\^|\*|;', program)
// UPDATED - ^^^ skip the one-shot match ^^^

patterns = [
    {type=symbol, pattern='[A-Za-z]'},
    {type=integer, pattern='\d+'},
    ...
    {type=eos, pattern=';'},
]

tokens = []

while i := 0 < len(program)
    // NEW - skip past whitespace
    if program[i:][0] in '\n\t '
        i += 1
        continue

    for pattern in patterns
        // UPDATED - find match in program starting at i
        if lexeme := pattern.match(program[i:])
            tokens.append(
                Token{type=pattern.type, lexeme=lexeme)})
            // NEW - skip ahead past match
            i += len(lexeme)
            break  // if found, move on to next lexeme

Now we are only getting tokens in one pass of the patterns. The only problem remaining: the patterns are still running separately. This is where we part ways with regex and write the specific token searcher we need - something functionally equivalent to the engine compiled by the regex pattern [A-Za-z]|=|\d+|\^|\*|;.

State Machines

Regex matchers dynamically generate a state machine based on the given expression and then run that against the input text (1)(2). A state machine is a collection of states, each with a transition function (1). There is a special start state, and one or more finish states (3). You run the machine like this:

feed (input, position) into the transition function for the current state

the transition function may advance the current position of the input
the result of the transition function indicates the next state
if next state found and more input available, make next = current and continue from 1.
halt the machine
if current state is a finish state, the run was successful, otherwise it failed

There are different classes of machines, as well as formal definitions of state machines as part of computation theory, but generally the above features should be enough to implement a matcher (2)(3). Indeed, for this simple language, we only need a few states:

State	Description	Transition To	Emits	Finish State?
startState	beginning of a program, after a token is emitted	startState, integerState, errorState, nothing	symbol, eol, multiply, exponentiate, nothing	yes
errorState	when we need to halt but current state is a finish state	nothing	nothing	no
integerState	when we are at the start of an integer	startState, nothing	integer, nothing	no

…we could add states if we wanted, or get rid of integerState, but this is a good compromise to demonstrate various techniques used for larger languages.

// ITEM 1)
// input: program text, current position index
// output: next state function, updated position index

func startState(program, i)
...

func errorState(program, i)
...

func integerState(program, i)
...

currentState := startState
finishStates := [startStates]
tokens = []

// ITEM 4) + 5)
// run machine in a loop as long as there is input and a state
while currentState and i < len(program)

    // ITEM 3)
    // determine next state from result of current transition
    nextState, i := currentState(program, i)
    currentState = nextState

// ITEM 6)
// indicate failure if halted in non-finish state
if currentState not in finishStates:
    raise SyntaxError
else:
    // success!
    assert tokens

We can create a to-do list with most items already scratched off!

~~state functions take (input, position) as parameters~~

transition functions advance position
~~result of a transition determines the next state~~
~~run in a loop as long as there is a next state~~
~~machine halts if no more input or state~~
~~indicate failure if halted in non-finish state~~

The only thing left is to make the transitions work correctly, without regex help. Most of the work can be done in startState (1):

func startState(program, i)
    if i == len(program):
        return None, i
    r := program[i]

    // ignore whitespace
    if r in '\n\t\ ':
        return startState, i + 1

    if r.isAlpha():
        tokens.append(
            Token{type=symbol, lexeme=r))
        return startState, i + 1
    ..
    ..
    if r == '='
        tokens.append(
            Token{type=equals, lexeme=r))
        return startState, i + 1

    if r.isDecimalDigit()
        return integerState, i

    return errorState, i

The state for parsing integers is a good example of delegating a more involved task to its own state (1):

func integerState(program, i)
    for j in i..len(program)
        if not program[j].isDecimalDigit()
            break
    if j > i:
        tokens.append(
            Token{type=integer, lexeme=program[i:j]))
        return startState, j
    else:
        return None, j

…note that the errorState is not used here, because if integerState fails and returns no state, a syntax error will be raised because integerState is not a finish state. This implies a syntax rule we didn’t explicitly mention: that a program can’t end with an integer. If that were not the rule, we could add integerState to the set of finish states.

The errorState, as we’ve defined it, is really simple:

func errorState(program, i)
    return None, i

Outcome

Now that we have access to the internals of the scanner, we have a lot of flexibility. We can:

emit complete tokens in a single pass of the input
make the syntax rules as precise or lenient as possible
wire in, or interface with, a parser in whatever manner we need to
defer some validation to the later stages
insert detailed logging at any point for the user to see where/why tokenization failed

I did a lazy benchmark of these two scanners, each ported to python and run on cpython 3.8.2 (using pre-compiled stdlib regex), with 1e5 scans of the same test-program shows a 1.4x slowdown for the regex version of the scanner:

$ time python regex.py ; time python state-machine.py 

real    0m20.210s
user    0m20.156s
sys     0m0.004s

real    0m13.605s
user    0m13.506s
sys     0m0.059s

$ bc <<< "scale=1; 20.156/13.506"
1.4

In this very narrow case, performance is clearly better for the static state machine, but not notably, given that a single parse takes an average of <0.5ms in either case. A more thorough analysis of inputs, languages and regex engines would be interesting but is out of scope here.

Alternatives

(Or: This seems like a lot of work)

The time you save may not be just your own. You only have to do it ~once for a given language. And this pattern can be applied like a template if you need to do it for another language.

Still, if you think you can learn a new tool faster than you can learn this technique, there are lexer generators for many popular languages (e.g. flex, ANTLR, PLY, RLTK) (4)(5). You give them a token set (sometimes as regex!), and they generate a lexer. You can generally expect varying levels of performance and code clarity with a generator (1). While they become essential for the later stages of parsing, they can be more trouble than they are worth for scanning (1). It depends on you and the available tools; a survey of non-human lexer generators is out of the scope of this post.

Acknowledgements

This work owes a lot to Rob Pike’s ca. 2011 talk on the topic of lexers in Go, and presumably his original blog post, which I have not had a chance to read. I watched the talk so I could write a lexer in Go for DNS zone files (which I did), and I did this write-up because I didn’t understand a point made in his presentation – about why switch statements are not great – and I wanted to zoom in a bit to understand why.

I should like to thank the people at work who saw my RLTK parser demo in 2016 and put at least one or two more regex-based parsers into the wild. Somewhere between “TL;DR” and “unimpressive demo” is a sweet spot where you win friends and influence people.

Also I ought to thank Karen Lemone, from whom I first learned about lexing.

References

(1) Pike (2011) Lexical Scanning in Go
(2) Freidl (1997) Mastering Regular Expressions
(3) Linz (1997) Introduction to Formal Languages and Automata
(4) Wikipedia (2020) [Lexical Analysis] (https://en.wikipedia.org/wiki/Lexical_analysis)
(5) Wailes (2015) RLTK | Ruby Language Tookkit