CSP LR(1) Parser Generator

For this tutorial, we'll create a simple parser for the 4 basic math operations in an equation. Just to make things interesting, we'll use floating point (doubles) because we'll need to expand the definition of a token to keep the result as it's parsed. We'll also go further into some of the more advanced features while still keeping it simple.

First, we need a grammar for the 4 operations of additon, substraction, multiplication and division. Here is a basic grammar for what we want.

expr:
  expr '+' num
  expr '-' num
  expr '*' num
  expr '/' num
  num
  
num:
  ###!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?##

A few things to note here is that all 4 operations have the same precedence level. Also, it would be nice to include parentheses to create explicit orders of operations.

addsub:
  addsub '+' muldiv
  addsub '-' muldiv
  muldiv
  
muldiv:
  muldiv '*' num
  muldiv '/' num
  num
  
num:
  ###!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?##
  '(' addsub ')'

Now we have the order of preference and we've added parentheses. However, this will only check for syntax. No computations will actually take place. For this, we need action routines. Here we should explain how action routines get called. Action routines only get called after a reduction. This is an LR(1) parser, so we don't know what production we're on until a reduction is done. Therefore, all action routines on a production get called one after the other in the order they appear in the production.

The arguments to the action routine are a start and end iterator for the tokens read in, one for each item in the production. Note that conditional section take up one token. Even if a production or conditional section reduced to epsilon, there is still an empty token there so that you always know what index is a token. After a reduction and all action routines have been called, only the first token will remain. So any values you want to pass up to other productions will have to be located in this first token. The third and last argument is a pointer to the token immediately preceeding the action routine.

The tokens however, only contain the NUMBER of the token and where in the input file it was read in. A parser doesn't care about the actual strings read in. It only cares about the token NUMBER so that it can continue going through the parse table. So we'll have to extend the CSScanner class to store the values read in. To store this value, we'll also need to extend the CSToken class. As for the four ops, we only need the NUMBER to tell them apart, so we just need to assign them ID's.

addsub:
  addsub #ADD##+## muldiv _action_Compute
  addsub #SUB##-## muldiv _action_Compute
  muldiv
  
muldiv:
  muldiv #MUL##*## num _action_Compute
  muldiv #DIV##/## num _action_Compute
  num
  
num:
  #NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
  '(' addsub ')' _action_MoveValue

You may have noticed few things here. One is that we use that same action routine for all four operations. This is because we can just check token->tokenID for the values of CSPTOK_ADD, CSPTOK_SUB, CSPTOK_MUL, CSPTOK_DIV and do the appropriate computation. No need for multiple action routines although we could have done it the other way. The other way is slightly faster because no checking is done. But it requires more functions and unecessarily clutters the code.

Another thing is the MoveValue action routine. The addsub production will have a token assigned to it for the purposes of the action routine. And it will contain the results of previous calculations. But to keep passing values along, it must be in the first token. The first token here is the '(' token. So we simply copy the value from the addsub token to the '(' token. That's why we need an action routine. Note that this is not necessary for the production immediately above it for numbers because there is only one token, therefore the value is already in the first token.

We also added the word "number" as the expected string so that the parser can tell the user what it was expecting rather than saying it expected "([0-9]*\\.)?[0-9]+([eE][-+]?[0-9]+)?". We also gave it an ID of NUMBER so that we know to save its value in the scanner. For the four operators, we don't mention an expect string because printing the operator itself is easier to understand.

So far, we have simply worked on the grammar. No code or nothing has actually been done. But there is one last thing to do. When we are done computing, we would like to store the value somewhere not in a token so that we can print it when the parser returns. This is because when the parser returns, all tokens are deleted and the value from the computations will be gone. So we simply add another production with an action routine to save the value.

expr:
  addsub _action_SaveValue EOL
  
addsub:
  addsub #ADD##+## muldiv _action_Compute
  addsub #SUB##-## muldiv _action_Compute
  muldiv
  
muldiv:
  muldiv #MUL##*## num _action_Compute
  muldiv #DIV##/## num _action_Compute
  num
  
num:
  #NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
  '(' addsub ')' _action_MoveValue

We forgot one thing: spaces. We can't use any spaces in this grammar right now. Here we use the dot where we can use spaces and have to add the SPACES production. We'll also add an end of line production. However, there is one thing to watch for. We cannot use dots in front of the operators. Because CSP LR(1) chooses shifting instead of reducing in a conflict to handle the dangling else problem, we would get a parser that doesn't quite do what we want. Look at the following INCORRECT grammar.

expr:
  . addsub _action_SaveValue . EOL
  
addsub:
  addsub . #ADD##+## . muldiv _action_Compute
  addsub . #SUB##-## . muldiv _action_Compute
  muldiv
  
muldiv:
  muldiv . #MUL##*## . num _action_Compute
  muldiv . #DIV##/## . num _action_Compute
  num
  
num:
  #NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
  '(' . addsub _action_MoveValue . ')' 

SPACES:
  ###![ \t]*#Space#
  @

EOL:
  #EOL##![\r\n]#End of line#
  @

After reading in a floating point number, it would reduce to num. Now if a space follows, does it reduce to muldiv or reduce all the way to addsub? The problem is that it would reduce to muldiv and since this can shift spaces, it would not reduce further and expect a * or / to follow even though a + or - is actually in the input stream. And this would cause an error. If this were a LR(2) parser generator, then we'd be fine, but since we only look one token ahead, we have to be careful of these situations. We'll create a regular expression for each of the operators so that it can "eat" the spaces. If you want to remove the spaces entirely from the input, you can override the GetToken function in CSScanner to do just that.

expr:
  . addsub _action_SaveValue . EOL
  
addsub:
  addsub #ADD##![ \t]*+[ \t]*#+# muldiv _action_Compute
  addsub #SUB##![ \t]*-[ \t]*#-# muldiv _action_Compute
  muldiv
  
muldiv:
  muldiv #MUL##![ \t]*\*[ \t]*#*# num _action_Compute
  muldiv #DIV##![ \t]*/[ \t]*#/# num _action_Compute
  num
  
num:
  #NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
  '(' . addsub _action_MoveValue . ')' 

SPACES:
  ###![ \t]*#Space#
  @

EOL:
  #EOL##![\r\n]#End of line#
  @

We moved the action routine for MoveValue so that we have direct access to the addsub value. The epsilon production are there because spaces aren't required. Although the regular expression can match 0 characters, the parser does not recognise this as a match. The parser requires at least 1 character, otherwise you must use an epsilon production.

That's it for the grammar. And it's to be expected that the grammar would be where the most planning is required. Everything revolves around your grammar.

Now we get down to coding. First thing is to create the core of the parser. Create an empty directory and save the above grammar in a file in that directory. I will call it grammar.txt for simplicity. It also helps to copy csparser.exe or whatever executable you have if on another OS into this directory. Type "csparser grammar.txt" in a windows shell or "./csparser grammar.txt" on Linux. This should simply say "Success!" and have created 5 units (10 files) in this directory along with other files that show the internals of the states and productions before and after table compression.

Use your favorite compiler and add the 5 .cpp files and 5 header files if your compiler needs it.

Now we need to add a double value to the CSToken class. We don't actually touch any of the generated code otherwise, if we change the grammar, our changes will be lost. The CSScanner class is what allocates tokens, thus we'll need to subclass this. We may as well subclass it now (and we'll override all its functions). So we create a new unit (.cpp and .h file). We'll call this unit MyScanner.cpp and MyScanner.h. The CSToken class is defined in the CSParser.h file, so we need to include that in the MyScanner.h file. This is what MyScanner.h will look like.

#ifndef MyScannerH
#define MyScannerH

#include "CSParser.h"
#include "CSScanner.h"

class MyToken : public CSToken
{
public:
  double value;
  MyToken();
  ~MyToken();
};

class MyScanner : public CSScanner
{
public:
  bool error;
  MyScanner(FILE *fh);
  void GetToken(CSToken &token, int *tokens, int tokens_count);
  CSToken* AllocToken();
  void Error(CSParser *parser, char *str, CSToken *tok);
};
#endif

In the .cpp file, we don't need to do much, just create a constructor that sets value to 0 to be safe and an empty destructor.

Now we move on to the scanner. For the constructor, we just pass the argument to the base class. The AllocToken function is what allocates tokens. Therefore, we need to override AllocToken so that we allocate MyToken instead of CSToken.

For the Error function, we simply need to call the base class function and then set error to true. This is so we know if there was an error or not when the parser is done.

For the GetToken function, we need to store the value if the token read in is a number.

This may seem like a bit of work, but it's done this way so that you can customise the parser any which way you please. There is nothing in this parser that cannot be customised. Also, we are going into some of the more advanced features of the parser. In any case, as you'll see, most functions are empty or only have one line.

#include <stdio.h>
#include <stdlib.h>
#include <string>
#include "MyScanner.h"
#include "CSTokens.h"
#include "CSRegEx.h"

MyToken::MyToken()
  : CSToken()
{
  value = 0.0;
}

MyToken::~MyToken()
{
}

MyScanner::MyScanner(FILE *fh)
  : CSScanner(fh)
{
}

CSToken* MyScanner::AllocToken()
{
  return new MyToken();
}

void MyScanner::GetToken(CSToken &token, int *tokens, int tokens_count)
{
  CSScanner::GetToken(token,tokens,tokens_count);
  
  // Store the value if it's a number.
  if (token.tokenID==CSPTOK_NUMBER)
  {
    char c = line[charID];
    line[charID] = '\0';
    MyToken *tok = dynamic_cast<MyToken*>(&token);
    tok->value = atof(line+token.column-1);
    line[charID] = c;
  }
}

void MyScanner::Error(CSParser *parser, char *str, CSToken *tok)
{
  CSScanner::Error(parser,str,tok);
  error = true;
}

The action routines are super easy to do. We now subclass the CSAction class. Create a new unit called MyAction.cpp and MyAction.h. The action routine prototypes are copy and pasted from the CSAction.h file without the "=0" part. We'll just show the two classes as they're extremely simple.

#ifndef MyActionH
#define MyActionH
#include "CSAction.h"
#include "MyScanner.h"

class MyAction : public CSAction
{
protected:
  virtual bool SaveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token);
  virtual bool Compute(CSTokenIterator start, CSTokenIterator end, CSToken *token);
  virtual bool MoveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token);
public:
  bool error;
  double value; // the final result.
  MyAction();
  ~MyAction();
};
#endif

#include "MyAction.h"
#include "CSTokens.h"

MyAction::MyAction()
  : CSAction()
{
  value = 0.0;
  error = false;
}

MyAction::~MyAction()
{
}

// Action routines
bool MyAction::SaveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token)
{
  MyToken *tok = dynamic_cast<MyToken*>(token);
  value = tok->value;
  return true;
}

bool MyAction::Compute(CSTokenIterator start, CSTokenIterator end, CSToken *token)
{
  // Get token for the operator
  MyToken *op = dynamic_cast<MyToken*>(*(start+1));
  MyToken *arg1 = dynamic_cast<MyToken*>(*start);
  MyToken *arg2 = dynamic_cast<MyToken*>(token);
  if (op->tokenID==CSPTOK_ADD)
  {
    arg1->value += arg2->value;
  }
  else if (op->tokenID==CSPTOK_SUB)
  {
    arg1->value -= arg2->value;
  }
  else if (op->tokenID==CSPTOK_MUL)
  {
    arg1->value *= arg2->value;
  }
  else if (op->tokenID==CSPTOK_DIV)
  {
    if (arg2->value==0.0)
    {
      printf("Error: Division by zero.\n");
      error = true;
      return false;
    }
    arg1->value/=arg2->value;
  }
  return true;
}

bool MyAction::MoveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token)
{
  MyToken *tok = dynamic_cast<MyToken*>(token);  // get addsub token
  MyToken *firsttok = dynamic_cast<MyToken*>(*start); // get first token
  firsttok->value = tok->value;  // copy value over
  return true;
}

#include <stdio.h>
#include "MyAction.h"
#include "MyScanner.h"
#include "CSParser.h"

int main(int argc, char *argv[])
{
  if (argc!=2)
  {
    printf("Usage: calc {filename}\n");
    return 0;
  }
  
  FILE *fh = fopen(argv[1],"r");
  if (fh==NULL)
  {
    printf("Could not open file \"%s\" for reading.", argv[1]);
    return 0;
  }
  
  // Create scanner
  MyScanner *scanner = new MyScanner(fh);
  
  // Create action routines
  MyAction *actions = new MyAction();
  
  // Create parser
  CSParser *parser = new CSParser(scanner,actions);

  // Parse input file.
  parser->Parse();
  
  // Check if no errors.
  if ((!scanner->error)&&(!actions->error))
  {
    printf("Result: %f\n",actions->value);
  }

  delete parser;
  delete scanner;
  delete actions;

  fclose(fh);
  return 0;
}

Now that you have a working parser (a framework if you will), you can extend it to use a grammar of your choice.