Tutorial for CSP Canonical LR(1) Parser Generator
For this tutorial, we'll create a simple parser for the 4 basic math operations in
an equation. Just to make things interesting, we'll use floating point (doubles) because
we'll need to expand the definition of a token to keep the result as it's parsed. We'll
also go further into some of the more advanced features while still keeping it simple.
First, we need a grammar for the 4 operations of additon, substraction, multiplication and
division. Here is a basic grammar for what we want.
expr:
expr '+' num
expr '-' num
expr '*' num
expr '/' num
num
num:
###!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?##
A few things to note here is that all 4 operations have the same precedence level. Also, it
would be nice to include parentheses to create explicit orders of operations.
addsub:
addsub '+' muldiv
addsub '-' muldiv
muldiv
muldiv:
muldiv '*' num
muldiv '/' num
num
num:
###!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?##
'(' addsub ')'
Now we have the order of preference and we've added parentheses. However, this will only
check for syntax. No computations will actually take place. For this, we need action routines.
Here we should explain how action routines get called. Action routines only get called after
a reduction. This is an LR(1) parser, so we don't know what production we're on until a reduction
is done. Therefore, all action routines on a production get called one after the other in the order
they appear in the production.
The arguments to the action routine are a start and end iterator for the tokens read in, one
for each item in the production. Note that conditional section take up one token. Even if a
production or conditional section reduced to epsilon, there is still an empty token there so that
you always know what index is a token. After a reduction and all action routines have been called,
only the first token will remain. So any values you want to pass up to other productions will have
to be located in this first token. The third and last argument is a pointer to the token
immediately preceeding the action routine.
The tokens however, only contain the NUMBER of the token and where in the input file it was read in.
A parser doesn't care about the actual strings read in. It only cares about the token NUMBER so that
it can continue going through the parse table. So we'll have to extend the CSScanner class to
store the values read in. To store this value, we'll also need to extend the CSToken class. As for the
four ops, we only need the NUMBER to tell them apart, so we just need to assign them ID's.
addsub:
addsub #ADD##+## muldiv _action_Compute
addsub #SUB##-## muldiv _action_Compute
muldiv
muldiv:
muldiv #MUL##*## num _action_Compute
muldiv #DIV##/## num _action_Compute
num
num:
#NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
'(' addsub ')' _action_MoveValue
You may have noticed few things here. One is that we use that same action routine for all
four operations. This is because we can just check token->tokenID for the values of CSPTOK_ADD,
CSPTOK_SUB, CSPTOK_MUL, CSPTOK_DIV and do the appropriate computation. No need for multiple
action routines although we could have done it the other way. The other way is slightly faster
because no checking is done. But it requires more functions and unecessarily clutters the code.
Another thing is the MoveValue action routine. The addsub production will have a token assigned
to it for the purposes of the action routine. And it will contain the results of previous calculations.
But to keep passing values along, it must be in the first token. The first token here is the '('
token. So we simply copy the value from the addsub token to the '(' token. That's why we need
an action routine. Note that this is not necessary for the production immediately above it for numbers
because there is only one token, therefore the value is already in the first token.
We also added the word "number" as the expected string so that the parser can tell the user
what it was expecting rather than saying it expected "([0-9]*\\.)?[0-9]+([eE][-+]?[0-9]+)?". We also gave it an ID of NUMBER so
that we know to save its value in the scanner. For the four operators, we
don't mention an expect string because printing the operator itself is easier to understand.
So far, we have simply worked on the grammar. No code or nothing has actually been done.
But there is one last thing to do. When we are done computing, we would like to store the value
somewhere not in a token so that we can print it when the parser returns. This is because when
the parser returns, all tokens are deleted and the value from the computations will be gone.
So we simply add another production with an action routine to save the value.
expr:
addsub _action_SaveValue EOL
addsub:
addsub #ADD##+## muldiv _action_Compute
addsub #SUB##-## muldiv _action_Compute
muldiv
muldiv:
muldiv #MUL##*## num _action_Compute
muldiv #DIV##/## num _action_Compute
num
num:
#NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
'(' addsub ')' _action_MoveValue
We forgot one thing: spaces. We can't use any spaces in this grammar right now. Here we use the dot where
we can use spaces and have to add the SPACES production. We'll also add an end of line production.
However, there is one thing to watch for. We cannot use dots in front of the operators. Because
CSP LR(1) chooses shifting instead of reducing in a conflict to handle the dangling else problem,
we would get a parser that doesn't quite do what we want. Look at the following INCORRECT grammar.
expr:
. addsub _action_SaveValue . EOL
addsub:
addsub . #ADD##+## . muldiv _action_Compute
addsub . #SUB##-## . muldiv _action_Compute
muldiv
muldiv:
muldiv . #MUL##*## . num _action_Compute
muldiv . #DIV##/## . num _action_Compute
num
num:
#NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
'(' . addsub _action_MoveValue . ')'
SPACES:
###![ \t]*#Space#
@
EOL:
#EOL##![\r\n]#End of line#
@
After reading in a floating point number, it would reduce to num. Now if a space follows,
does it reduce to muldiv or reduce all the way to addsub? The problem is that it would reduce
to muldiv and since this can shift spaces, it would not reduce further and expect a * or /
to follow even though a + or - is actually in the input stream. And this would cause an error.
If this were a LR(2) parser generator, then we'd be fine, but since we only look one token
ahead, we have to be careful of these situations. We'll create a regular expression for each
of the operators so that it can "eat" the spaces. If you want to remove the spaces entirely from
the input, you can override the GetToken function in CSScanner to do just that.
expr:
. addsub _action_SaveValue . EOL
addsub:
addsub #ADD##![ \t]*+[ \t]*#+# muldiv _action_Compute
addsub #SUB##![ \t]*-[ \t]*#-# muldiv _action_Compute
muldiv
muldiv:
muldiv #MUL##![ \t]*\*[ \t]*#*# num _action_Compute
muldiv #DIV##![ \t]*/[ \t]*#/# num _action_Compute
num
num:
#NUMBER##!([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?#number#
'(' . addsub _action_MoveValue . ')'
SPACES:
###![ \t]*#Space#
@
EOL:
#EOL##![\r\n]#End of line#
@
We moved the action routine for MoveValue so that we have direct access to the addsub value.
The epsilon production are there because spaces aren't required. Although the regular expression
can match 0 characters, the parser does not recognise this as a match. The parser requires at
least 1 character, otherwise you must use an epsilon production.
That's it for the grammar. And it's to be expected that the grammar would be where the most
planning is required. Everything revolves around your grammar.
Now we get down to coding. First thing is to create the core of the parser. Create an empty
directory and save the above grammar in a file in that directory. I will call it grammar.txt for
simplicity. It also helps to copy csparser.exe or whatever executable you have if on another OS
into this directory. Type "csparser grammar.txt" in a windows shell or "./csparser grammar.txt" on Linux.
This should simply say "Success!" and have created 5 units (10 files) in this directory along with
other files that show the internals of the states and productions before and after table compression.
Use your favorite compiler and add the 5 .cpp files and 5 header files if your compiler needs it.
Now we need to add a double value to the CSToken class. We don't actually touch any of the generated
code otherwise, if we change the grammar, our changes will be lost. The CSScanner class is what
allocates tokens, thus we'll need to subclass this. We may as well subclass it now (and we'll override all
its functions). So we create a new unit (.cpp and .h file).
We'll call this unit MyScanner.cpp and MyScanner.h. The CSToken class is defined in the CSParser.h file,
so we need to include that in the MyScanner.h file. This is what MyScanner.h will look like.
#ifndef MyScannerH
#define MyScannerH
#include "CSParser.h"
#include "CSScanner.h"
class MyToken : public CSToken
{
public:
double value;
MyToken();
~MyToken();
};
class MyScanner : public CSScanner
{
public:
bool error;
MyScanner(FILE *fh);
void GetToken(CSToken &token, int *tokens, int tokens_count);
CSToken* AllocToken();
void Error(CSParser *parser, char *str, CSToken *tok);
};
#endif
In the .cpp file, we don't need to do much, just create a constructor that sets value to 0
to be safe and an empty destructor.
Now we move on to the scanner. For the constructor, we just pass the argument to the base class.
The AllocToken function is what allocates tokens. Therefore,
we need to override AllocToken so that we allocate MyToken instead of CSToken.
For the Error function, we simply need to call the base class function and then set error to
true. This is so we know if there was an error or not when the parser is done.
For the GetToken function, we need to store the value if the token read in is a number.
This may seem like a bit of work, but it's done this way so that you can customise the parser
any which way you please. There is nothing in this parser that cannot be customised. Also,
we are going into some of the more advanced features of the parser. In any case, as you'll see,
most functions are empty or only have one line.
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include "MyScanner.h"
#include "CSTokens.h"
#include "CSRegEx.h"
MyToken::MyToken()
: CSToken()
{
value = 0.0;
}
MyToken::~MyToken()
{
}
MyScanner::MyScanner(FILE *fh)
: CSScanner(fh)
{
}
CSToken* MyScanner::AllocToken()
{
return new MyToken();
}
void MyScanner::GetToken(CSToken &token, int *tokens, int tokens_count)
{
CSScanner::GetToken(token,tokens,tokens_count);
// Store the value if it's a number.
if (token.tokenID==CSPTOK_NUMBER)
{
char c = line[charID];
line[charID] = '\0';
MyToken *tok = dynamic_cast<MyToken*>(&token);
tok->value = atof(line+token.column-1);
line[charID] = c;
}
}
void MyScanner::Error(CSParser *parser, char *str, CSToken *tok)
{
CSScanner::Error(parser,str,tok);
error = true;
}
The hard part is now done. And the only real code was storing the number.
The action routines are super easy to do. We now subclass the CSAction class. Create
a new unit called MyAction.cpp and MyAction.h. The action routine prototypes are copy and pasted
from the CSAction.h file without the "=0" part. We'll just show the two classes as they're
extremely simple.
MyAction.h:
#ifndef MyActionH
#define MyActionH
#include "CSAction.h"
#include "MyScanner.h"
class MyAction : public CSAction
{
protected:
virtual bool SaveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token);
virtual bool Compute(CSTokenIterator start, CSTokenIterator end, CSToken *token);
virtual bool MoveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token);
public:
bool error;
double value; // the final result.
MyAction();
~MyAction();
};
#endif
MyAction.cpp
#include "MyAction.h"
#include "CSTokens.h"
MyAction::MyAction()
: CSAction()
{
value = 0.0;
error = false;
}
MyAction::~MyAction()
{
}
// Action routines
bool MyAction::SaveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token)
{
MyToken *tok = dynamic_cast<MyToken*>(token);
value = tok->value;
return true;
}
bool MyAction::Compute(CSTokenIterator start, CSTokenIterator end, CSToken *token)
{
// Get token for the operator
MyToken *op = dynamic_cast<MyToken*>(*(start+1));
MyToken *arg1 = dynamic_cast<MyToken*>(*start);
MyToken *arg2 = dynamic_cast<MyToken*>(token);
if (op->tokenID==CSPTOK_ADD)
{
arg1->value += arg2->value;
}
else if (op->tokenID==CSPTOK_SUB)
{
arg1->value -= arg2->value;
}
else if (op->tokenID==CSPTOK_MUL)
{
arg1->value *= arg2->value;
}
else if (op->tokenID==CSPTOK_DIV)
{
if (arg2->value==0.0)
{
printf("Error: Division by zero.\n");
error = true;
return false;
}
arg1->value/=arg2->value;
}
return true;
}
bool MyAction::MoveValue(CSTokenIterator start, CSTokenIterator end, CSToken *token)
{
MyToken *tok = dynamic_cast<MyToken*>(token); // get addsub token
MyToken *firsttok = dynamic_cast<MyToken*>(*start); // get first token
firsttok->value = tok->value; // copy value over
return true;
}
Now all we have to do is create the main function. We'll put this in main.cpp.
#include <stdio.h>
#include "MyAction.h"
#include "MyScanner.h"
#include "CSParser.h"
int main(int argc, char *argv[])
{
if (argc!=2)
{
printf("Usage: calc {filename}\n");
return 0;
}
FILE *fh = fopen(argv[1],"r");
if (fh==NULL)
{
printf("Could not open file \"%s\" for reading.", argv[1]);
return 0;
}
// Create scanner
MyScanner *scanner = new MyScanner(fh);
// Create action routines
MyAction *actions = new MyAction();
// Create parser
CSParser *parser = new CSParser(scanner,actions);
// Parse input file.
parser->Parse();
// Check if no errors.
if ((!scanner->error)&&(!actions->error))
{
printf("Result: %f\n",actions->value);
}
delete parser;
delete scanner;
delete actions;
fclose(fh);
return 0;
}
Compile and create a file with an equation. The parser should print the result.
Now that you have a working parser (a framework if you will), you can extend it to use a grammar of your choice.
Webmaster: Cléo Saulnier
|