Event driven Context Free Grammar (CFG) parsing

As a software engineer and a semi regular computer science guy, I like to take various software challenges that get presented at work and dig a little bit deeper. One such issue was that of exploring context free grammars. We sometimes need to express a CFG and then use a ready made parsing library to check the validity of the string to see if it is accepted by a specific CFG. One of the well known parsers to do CFG parsing is the Earley Parser. The Earley Parser has clearly laid out approach to check if a CFG can express a string.

Some notation before we get started, A CFG consists of a series of rules. The LHS of the rule is a non-terminals and the RHS is a combination of terminals and non-terminals.

For example (quoting from Wikipedia): The following CFG rules

<P> ::= <S>      # the start rule
<S> ::= <S> "+" <M> | <M>
<M> ::= <M> "*" <T> | <T>
<T> ::= "1" | "2" | "3" | "4"

expresses some basic arithmetic expression such as 1+3*4+2 . Note that we are not covering operator precedence here. Just that the string should conform to the grammar above. Here the symbols like 1 , 2 , + etc. are the terminals whereas <T> , <M> represent the non-terminals.

It consists of 3 stages:

  1. Prediction: List out expectation of the next symbol from the in-process rules since that is what we “expect” to get as we scan the input string
  2. Scanning: Read through the input string and update the ruleset which are in various states of completion (due to having being updated by previous symbols from the input string)
  3. Completion: If any rule reaches the last symbol in the RHS, then that rule is considered complete and it in turn updates rules which consumes this rule’s LHS

I have tried to give an intuitive sense of the algorithm here, rather than be super formal about the description. For more formality refer to the Wikipedia explanation. While the algorithm can look a bit unintuitive, squint at a bit and what emerges is a making of an event-driven system!

Here are the components of the event-driven architecture:

So you have a stream of input events consisting of the tokens from the input string. These will be read and fed in a left to right order into the Symbol queue. The events in the Symbol queue will in turn be sent to event listeners. What are the event listeners? Why they are the in-process rules of the CFG of course. The list of event listeners is initialized with the list of base rules from the the CFG. An example of a base rule would look like: <S> ::= <S> + <M>.
The base rules always fire for each symbol received from the Symbol Queue.

A Rule/Event Listener maintains the following state:

  • The input position, i, of the Symbol in the input string from where onwards the Rule should match
  • The position, r of the RHS symbol which is to be matched

When a Rule/Event Listener receives a Symbol

  • If a rule is completely satisfied, i.e. r==len(rhs) the entire RHS of a rule is satisfied, then it emits its own LHS as an event into the Symbol queue
  • If a rule is partially satisfied, then it adds a new event listener, which has the input position set to i+1 and rhs position is r+1 and add it to the list of event listeners

After the entire input string is parsed, if the final event symbol emitted is the start symbol ( <P> ) then we can say the input string is generated by the CFG. There are a lot of nuances to the simplified explanation above which I have elided for the sake of simplicity. For more details refer to the Parser code I have checked into github or view the code here:

Especially beautiful is how it handles the case of matching parenthesis. See example in the code above.

One point to be noted is that null symbols are problematic and I tried a few quick ways to deal with it and did not find a mechanism to make it work. The good part however is that null symbol grammar can be converted into a non-null grammar expression. For example:

<P> ::= <S>
<S> ::= <S> <S>
<S> ::= ( <S> )
<S> ::= [ <S> ]
<S> ::= ε #null terminal condition

can be converted to

<P> ::= <S>
<S> ::= <S> <S>
<S> ::= ( <S> )
<S> ::= [ <S> ]
<S> ::= ( ) # terminal condition
<S> ::= [ ] # terminal condition

which makes it compatible with a non null supporting parser (albeit a bit verbose)

It is possible in theory to build a null supporting parser to by auto-converting the rules into a non-null format, but since this was a hobby project, I decided to call it “good enough” at this point.

Hope you enjoyed this excursion!

A lazy coder who works hard to build good software that does not page you in the middle of the night!