Coco/R for C#

H. Mössenböck
Johannes Kepler University Linz
Institute of Practical Computer Science

This version amended (23 October 2002) and extended by
P.D. Terry
Computer Science Department
Rhodes University, Grahamstown

You may obtain the latest version of this document from http://www.scifac.ru.ac.za/coco/cshcoco.htm

Coco/R is a compiler generator which takes a compiler description in the form of an LL(1) attributed grammar and generates a scanner and parser for the described language. This paper describes a version of Coco/R for C# that slightly extends the original, and is a modified version of the original documentation by Mössenböck that you can read here. There are also implementations of Coco/R for Java, C, C#, Pascal, Delphi and Modula-2 and for Oberon.

Contents

  1. Introduction
  2. The specification language Cocol/R
    2.1 Vocabulary
    2.2 Structure of a compiler description
    2.3 Scanner specification
    2.4 Parser specification
  3. User Guide
    3.1 The Parser interface
    3.2 The Token interface
    3.3 The Scanner interface
    3.4 The Errors interface
    3.5 The driver program
    3.6 Grammar tests
    3.7 Trace output
  4. Hints for Advanced Users of Coco/R
  5. A Sample Compiler
  6. Useful applications of Coco/R
  7. Sources of Coco/R
  8. Portability issues
  9. Literature

1. Introduction

Coco/R takes an LL(1) attributed grammar for a language, expressed in augmented EBNF, and generates C# code for a recursive descent parser and a scanner for that language. To generate a complete application, such as a compiler, a user has also to supply a simple main method that calls the parser, as well as semantic classes that are used from within the grammar (e.g., a symbol table handler and a code generator).


The following example gives you a rough impression of the input language (Cocol/R). It shows a grammar rule for the processing of declarations.

  VarDeclaration<ref int adr>
                         (. Obj obj, obj1;
                            Struct typ;
                            int n, a;
                            string name; .)
  = Ident<out name>      (. obj = SymTab.Find(name);
                            obj.next = null;
                            n = 1; .)
  { ',' Ident<out name>  (. obj1 = SymTab.Find(name);
                            obj1.next = obj;
                            obj = obj1;
                            n++; .)
  }
  ':'
  Type<out typ>          (. adr += n * typ.size;
                            a = adr;
                            while (obj != null) {
                              a -= typ.size;
                              obj.adr = a;
                              obj = obj.next;
                            } .)
    ';'.
The core of this specification is the EBNF rule
  VarDeclaration = Ident {',' Ident} ':' Type ';'.
It is augmented by attributes and semantic actions. The attributes (e.g. <out name>) specify the parameters of the symbols. There are input attributes (e.g. <x, y>) and output attributes (e.g. <out z>) that are written like parameters in C#. The semantic actions are arbitrary C# statements between "(." and ".)". Such semantic actions will be executed by the generated parser at the position in the grammar at which the associated syntactic element is encountered.

Coco/R generates a parser method for every grammar rule. The parser method of the above rule looks as follows:

  static void VarDeclaration (ref int adr) {
    Obj obj, obj1;
    Struct typ;
    int n, a;
    string name;
    name = Ident();
    obj = SymTab.Find(name);
    obj.next = null;
    n = 1;
    while (t.kind == ',') {
      Get();
      name = Ident();
      obj1 = SymTab.Find(name);
      obj1.next = obj;
      obj = obj1;
      n++;
    }
    Expect(':');
    typ = Type();
    adr += n * typ.size;
    a = adr;
    while (obj != null) {
      a -= typ.size;
      obj.adr = a;
      obj = obj.next;
    }
    Expect(';');
  }

In this code can also be seen the statements that interact with the generated scanner, whose responsibility it will be to read the input file and return a stream of tokens to the parser.

2. The specification language Cocol/R

A compiler description can be viewed as consisting of imports, declarations and grammar rules that describe the lexical and syntactical structure of a source language as well as its translation into a target language.

2.1 Vocabulary

The vocabulary of this version of Cocol/R uses identifiers, strings, characters and numbers:

  CHARACTERS
    letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit    = "0123456789" .
    cntl     = CHR(0) .. CHR(31) .
    tab      = '\t' .
    lf       = '\n' .
    cr       = '\r' .
    back     = CHR(92) .
    noQuote1 = ANY - '"' - cntl - back .
    noQuote2 = ANY - "'" - cntl - back .
    graphic  = ANY - cntl .
  TOKENS
    ident  = letter {letter | digit} .
    string =   '"' {noQuote1 | back graphic } '"'
             | "'" {noQuote2 | back graphic } "'" .
    number = digit {digit} .

Upper case letters are distinct from lower case letters. Strings must not cross line borders.

Within strings C# escape sequences are interpreted as such, thus for example '\t' representss CHR(9), and '\u0041' represents 'A'. This represents a departure from some other versions of Coco/R and has been implemented slightly differently in different releases.

Keywords are

   ANY         CASE        CHARACTERS  CHR         COMMENTS
   COMPILER    CONTEXT     END         EOF         FROM
   IGNORE      NAMES       NESTED      PRAGMAS     PRODUCTIONS
   SYNC        TO          TOKENS      WEAK        using

The following metacharacters are used to form EBNF expressions:

( )
{ }
[ ]
< >
(. .)
= . | + -
for grouping
for iterations
for options
for attributes
for semantic parts
as explained below
(a | b) c ... a c | b c
{a} b ... b | a b | a a b | a a a b | ...
[a] b ... b | a b

Comments are enclosed in "/*" and "*/" and may be nested.

2.2 Structure of a compiler description

A compiler description in Cocol/R is made up of the following parts:

   Cocol =
     "COMPILER" GoalIdentifier
     ArbitraryCSharpDeclarations
     ScannerSpecification
     ParserSpecification
     "END" GoalIdentifier "." .
   GoalIdentifier = ident .

The name after the keyword COMPILER is the grammar name and must match the name after the keyword END. The grammar name also denotes the topmost nonterminal (the start symbol).

Arbitrary C# code may follow the grammar name. This is not checked by Coco/R. It usually contain declarations of fields and methods to be used in the semantic actions of the parser class that is to be generated, for example.

  static int sum;

  static void Add(int x) {
    sum = sum + x;
  }
In front of the keyword COMPILER one can import namespaces such as:
  using System;
  using System.Collections;

The remaining parts of the compiler description specify the lexical and syntactical structure of the language to be processed. Effectively two grammars are specified - one for the lexical analyzer or scanner, and the other for the syntax analyzer or parser. The nonterminals (token classes) recognized by the scanner are regarded as terminals by the parser.

2.3 Scanner specification

A scanner has to read source text, skip meaningless characters, and recognize tokens which are then passed to a parser. Tokens may be classified either as literals or as token classes. Literals (like "while" and ">=") may be incorporated directly into productions as strings, and do not have to be named. Token classes (such as identifiers or numbers) must be named, and have structures that are specified by regular expressions, defined in EBNF. There are usually many different instances of a token class (e.g., many different identifiers) which are all recognized as the same kind of token.

A scanner specification consists of six optional parts that may be written in arbitrary order.

   ScannerSpecification =
     { CharacterSets
     | Tokens
     | Ignorable
     | Comments
     | Pragmas
     | UserNames
     } .

Character sets. The CharacterSets component allows for the declaration of names for character sets such as letters or digits, and defines the characters that may occur as member of these sets. These names may be used in the other sections of the scanner specification (but not in the parser specification).

   CharacterSets = "CHARACTERS" {SetDecl} .
   SetDecl       = SetIdent "=" CharacterSet .
   CharacterSet  = SimpleSet { ("+" | "-") SimpleSet } .
   SimpleSet     = SetIdent | string | "ANY" | SingleChar [".." SingleChar ] .
   SingleChar    = "CHR" "(" number ")" | character .
   SetIdent      = ident .

Simple character sets are denoted by one of:

string a set consisting of all characters in the string
SetIdent the previously declared character set with this name
CHR(i) a character set consisting of a single element with the ordinal value i (i <=255)
'x' a character set consisting of the single character "x"
CHR(i) .. CHR(j) a character set consisting of all characters whose ordinal values are in the range i ... j
'x' .. 'y' a character set consisting of all characters in the range 'x' .. 'y'
ANY the set of all acceptable (8 bit ASCII) characters (CHR(1) .. CHR(255)).

Simple sets may then be combined by the union (+) and difference (-) operators.

Examples

  CHARACTERS
    digit = "0123456789" .            /* the set of all digits */
    hexdigit = digit + "ABCDEF" .     /* the set of all hexadecimal digits */
    eol = '\r' .                      /* the end-of-line character */
    noDigit = ANY - digit .           /* any character that is not a digit */
    ctrlChars = '\u0001' .. CHR(31) . /* ascii control characters */

Tokens. A token is a terminal symbol for the parser but a syntactically structured symbol for the scanner. The token structure has to be described by a regular expression in EBNF:

   Tokens      = "TOKENS" { Token } .
   Token       = TokenSymbol [ "=" TokenExpr "." ] .
   TokenExpr   = TokenTerm { "|" TokenTerm } .
   TokenTerm   = TokenFactor {TokenFactor} [ "CONTEXT" "(" TokenExpr ")" ] .
   TokenFactor =   SetIdent | string
                 | "(" TokenExpr ")" |  "[" TokenExpr "]" | "{" TokenExpr "}" .
   TokenSymbol = TokenIdent | string .
   TokenIdent  = ident .

Tokens may be declared in any order. A token declaration defines a TokenSymbol together with its structure. Usually the symbol on the left-hand side of the declaration is an identifier, which is then used in other parts of the grammar to denote the structure described on the right-hand side of the declaration by a regular expression (expressed in EBNF). This TokenExpr may contain literals denoting themselves (for example "END"), or the names of character sets (for example digit), denoting an arbitrary character from such sets. The restriction to regular expressions means that it may not contain the names of any other tokens.

While token specification is usually straightforward, there are a number of subtleties that may need emphasizing. For example, since spaces are deemed to be irrelevant when they come between tokens in the input for most languages, one should not attempt to declare literal tokens that have spaces within them.

The grammar for tokens allows for empty right-hand sides. This may seem strange, especially as no scanner is generated if the right-hand side of a declaration is missing. This facility is used if the user wishes to supply a hand-crafted scanner, rather than the one generated by Coco/R (see Section 4). In this case, the symbol on the left-hand side of a token declaration may also simply be specified by a string, with no right-hand side. Tokens specified without right-hand sides are numbered consecutively starting from 0, and the hand-crafted scanner has to return token codes according to this numbering scheme.

There is one predeclared token EOF that can be used in productions where it is necessary to check explicitly that the end of the source has been reached. When the Scanner detects that the end of the source has been reached further attempts to obtain a token return only this one.

The CONTEXT phrase in a TokenTerm means that the term is only recognized when its right hand context in the input stream (i.e. the characters following in the input stream) matches the TokenExpr specified in brackets. Note that the context phrase is not part of the token.

Examples

   TOKENS
     ident  = letter {letter | digit} .
     real   = digit {digit} "." {digit} ["E" ["+"|"-"] digit {digit}] .
     number = digit {digit} | digit {digit} CONTEXT ("..") .

The CONTEXT phrase in the above example allows the scanner to distinguish between reals (e.g., 1.23) and range constructs (e.g., 1..2) that could otherwise not be scanned with a single character lookahead. After reading a "1" and a ".", the scanner still works on both alternatives. If the next character is again a "." the ".." phrase is pushed back to the input stream and a number is returned to the parser. If the next character is not a ".", the scanner continues with the recognition of a real number.

Comments and ignorable characters. Usually spaces within the source text of a program are irrelevant, and in scanning for the start of a token, a Coco/R generated scanner will simply ignore them. Other separators like tabs, line ends, and form feeds may also be declared irrelevant, and some applications may prefer to ignore the distinction between upper and lower case input.

Comments are difficult to specify with the regular expressions used to denote tokens - indeed, nested comments may not be specified at all in this way. Since comments are usually discarded by a parsing process, and may typically appear in arbitrary places in source code, it makes sense to have a special construct to express their structure.

Ignorable aspects of the scanning process are defined in Cocol by

   Comments  = "COMMENTS" "FROM" TokenExpr "TO" TokenExpr [ "NESTED" ] .
   Ignorable = "IGNORE" ( "CASE" | CharacterSet ) .

where the optional keyword NESTED should have an obvious meaning. A practical restriction is that comment brackets must not be longer than 2 characters. It is possible to declare several kinds of comments within a single grammar, for example:

   COMMENTS FROM "/*" TO "*/" NESTED
   COMMENTS FROM "//" TO eol
   IGNORE '\r' + '\t' + '\n'

The set of ignorable characters in this example is that which includes the standard white space separators in ASCII files. The null character CHR(0) should not be included in any ignorable set. It is used internally by Coco/R to mark the end of the input file.

Pragmas. Pragmas, like comments, are tokens that may occur anywhere in the input stream, but, unlike comments, cannot be ignored. Pragmas are often used to allow programmers to select compiler switches dynamically. Since it becomes impractical to modify the phrase structure grammar to handle this, a special mechanism is provided for the recognition and treatment of pragmas. In Cocol they are declared like tokens, but may have an associated semantic action that is executed whenever they are recognized by the scanner.

   Pragmas     = "PRAGMAS" { Pragma } .
   Pragma      = Token [ SemAction ] .
   SemAction   = "(." ArbitraryCSharpCode ".)" .

Example

   PRAGMAS
     option = "$" {letter} .  (. i := 1;
                                 while (i < t.val.length()) {
                                   if (t.val.charAt(i) == 'A') ...;
                                   else if (t.val.charAt(i) == 'B') ...;
                                   i++;
                                 } .)

Note: The next token to be delivered to the parser is available in the field t of the generated parser (see also Section 3.1). It holds the token code (t.kind), the token text (lexeme) (t.val) and the token position (t.pos, t.line, t.col). Thus, t.val is the string of the recognized pragma.

User names. Normally the generated scanner and parser use integer literals to denote the symbols and tokens. This makes for unreadable parsers, in some estimations. If a NAMES part is added to the scanner specification, or the pragma $N is used, or the command line directive -N, Coco/R will generate code that uses names for the symbols. By default these names have a rather stereotyped form, but preferred user-defined names may be specified - this may sometimes be needed to help resolve name clashes (for example, between the default names that would be chosen for "." and "point").

   UserNames  = "NAMES" { UserName } .
   UserName   = TokenIdent  "=" ( identifier | string ) "." .

Example

   NAMES
     period   = "." .
     ellipsis = "..." .

The ability to name terminals is an extension over the original implementation.

2.4 Parser specification

The parser specification is the main part of the compiler description. It contains the productions of an attributed grammar specifying the syntax of the language to be recognized, as well as the action to be taken as each phrase or token is recognized. The form of the parser specification may itself be described in EBNF as follows:

   ParserSpecification = "PRODUCTIONS" { Production } .
   Production          = NonTerminal [ FormalAttributes ] [ LocalDeclarations ]
                         '=' Expression "." .
   FormalAttributes    =   '<' ArbitraryCSharpParameterDeclarations '>'
                         | '<.' ArbitraryCSharpParameterDeclarations '.>' .
   LocalDeclarations   = "(." ArbitraryCSharpDeclarations ".)" .
   NonTerminal         = ident .
Any name appearing in a Production that has not earlier been declared as a terminal token is considered to be the name of a NonTerminal. There must be exactly one production for every NonTerminal (this production may, of course, specify a list of alternative right-hand sides).

Productions. A production is considered as a "procedure" that parses a nonterminal. It has its own scope for attributes and local names, and is made up of a left-hand side and a right-hand side which are separated by an equal sign. The left-hand side specifies the name of the nonterminal, together with its formal attributes and local declarations. The right-hand side consists of an EBNF expression that specifies the structure of the nonterminal as well as its translation.

As in the case of tokens, some subtleties in the specification of productions should be emphasized. Firstly, the productions may be given in any order, but a production must be given for a GoalIdentifier that matches the name used for the grammar.

The formal attributes enclosed in angle brackets "<" and ">" or "<." and ".>" essentially consist of parameter declarations in C#. For example, the declaration

   SomeSymbol <out int n, string name> = ... .
is translated into
   static void SomeSymbol (out int n, string name) {
     ...
   }

The local declarations are arbitrary C# declarations enclosed in "(." and ".)". A production constitutes a scope for its formal attributes and its locally declared names. Terminals and nonterminals, globally declared fields and methods, as well as public classes, are visible in any production.

The syntax of attributes and local declarations is not checked by Coco/R; this is left to the responsibility of the C# compiler that will actually compile the generated application.

The goal symbol may not have any FormalAttributes. Any information that the parser is required to pass back to the calling driver program must be handled in other ways. At times this may prove slightly awkward.

It may happen that an identifier chosen as the name of a NonTerminal may clash with one of the internal names used in the rest of the system. Such clashes will only become apparent when the application is compiled and linked, and may require the user to redefine the grammar to use other identifiers.

Expressions. An EBNF expression defines the context-free structure of some part of the source language, together with the attributes and semantic actions that specify how the parser must react to the recognition of each component.

   Expression = Term { '|' Term } .
   Term       = Factor { Factor } .
   Factor     =   [ "WEAK" ] TokenSymbol
                | NonTerminal [ Attributes ]
                | SemAction
                | "ANY"
                | "SYNC"
                | '(' Expression ')'
                | '[' Expression ']'
                | '{' Expression '}' .
   Attributes =   '<' ArbitraryCSharpParameters '>'
                | '<.' ArbitraryCSharpParameters '.>'.
   SemAction  = "(." ArbitraryCSharpStatements ".)" .
   Symbol     = ident | string .

The Attributes enclosed in angle brackets that may follow a NonTerminal in the context of a Factor effectively denote the actual parameters that will be used in calling the corresponding routine. If a NonTerminal has been defined on the left-hand side of a Production to have FormalAttributes, then every occurrence of that NonTerminal in the right-hand side of a Production must have a list of actual attributes that correspond to the FormalAttributes according to the parameter compatibility rules of C#. However, the conformance is only checked when the generated parser class is compiled.

If the attributes contain the symbol ">" they must be enclosed in brackets of the form "<." and ".>", for example

  <. a>b .>

A semantic action is an arbitrary sequence of C# statements enclosed in "(." and ".)". These are simply incorporated into the generated parser; their syntax is only checked when the generated parser is compiled. The digraph "(." is not allowed within this text, nor are incomplete strings, as these are symptomatic of mismatched text.

The symbol ANY denotes any terminal that cannot follow ANY in that context. It can conveniently be used to parse structures that contain arbitrary text. For example, the translation of a Cocol/R attribute list incorporates actions that look as follows:

   Attributes <out int len>
             (. int pos; .)
   = '<'     (. pos := token.pos + 1; .)
     {ANY}
     '>'     (. len := token.pos - pos; .) .

In this example the closing angle bracket is an implicit alternative of the ANY symbol in curly brackets. The meaning is that ANY matches any terminal except '>'. token.pos is the source text position of the most recently recognized token (token is a field of the generated parser; see Section 31).

Note that it is possible to write a production in which an action appears to be associated with an alternative for an expression that contains no terminals or nonterminals. This feature is often useful. For example, we might have

   Option =  "push" (. stack[++top] = item; .)
           | "pop"  (. item = stack[top--]; .)
           |        (. MonitorStack(); .) .

Syntax error handling. The programmer has to give some hints in order to allow Coco/R to generate good and efficient error handling. Firstly, synchronization points have to be specified. A synchronization point is a location in the grammar where particularly "safe" symbols are expected in the sense that they are hardly ever missing or mistyped, nor do they appear very often in the grammar.

In most languages, good candidates for synchronization points are the beginning of a statement (where if, while, etc. are expected), or the beginning of a declaration sequence (where static, public, etc. are expected). A synchronization point is specified by the symbol SYNC. When the generated parser reaches such a point, it skips all input until a symbol occurs that is expected at that point. The end-of-file symbol is always included amongst the synchronization symbols. This guarantees that synchronization terminates at least at the end of the source text.

The union of all synchronization sets we shall denote by AllSyncs. Error-handling can be further improved by specifying which terminals are weak in a certain context. A "weak terminal" is a symbol that is often mistyped or missing, such as the semicolon between statements, and is denoted by preceding it in the grammar with the keyword WEAK. When the parser expects a weak symbol, but does not find it in the input stream, it adjusts the input to the next symbol that is either a legal successor of the weak symbol or a member of AllSyncs (symbols expected at synchronization points are considered to be particularly "strong", so that it makes sense that they are never skipped).

Examples

   StatementSeq = Statement {WEAK ";" Statement} .
   Declaration = SYNC ("CONST" | "TYPE" | "VAR" | ...) .

As in the first of the above examples, iterations frequently start with a weak terminal, in situations that can be generally described by

   Sequence = FirstPart { WEAK ExpectedTerminal IteratedPart } LastPart .

Such weak separators are handled in a special way: if ExpectedTerminal is not recognized, source tokens are consumed until a token is found that is either a member of FIRST(IteratedPart) or of FIRST(LastPart) or of AllSyncs.

LL(1) requirements. Recursive descent parsing requires that the grammar of the parsed language satisfies the LL(1) property. This means that at any point in the grammar the parser must be able to decide on the basis of a single lookahead symbol which of several possible alternatives have to be selected. For example, the following production is not LL(1):

   Statement =   ident ':=' Expression
               | ident ['(' ExpressionList ')'] .

Both alternatives start with the symbol ident. When the parser comes to a statement and ident is the next input symbol, it cannot distinguish between these alternatives. However, the production can easily be transformed into

   Statement = ident ( ":=" Expression |  ['(' ExpressionList ')'] ) .

where all alternatives start with distinct symbols. There are LL(1) conflicts that are not as easy to detect as in the above example. It can be hard for programmers to detect these without a tool like Coco/R that checks the grammar automatically; Coco/R gives appropriate error messages that suggest how to correct any LL(1) conflicts.

3. User Guide

In order to generate an application (like a compiler) using Coco/R, a user has to

Coco/R processes an attributed grammar to generate the following files:

All generated classes belong to a namespace, whose name is the name of the start symbol of the attributed grammar. These classes are generated by inserting appropriate C# source into so-called frame files:

It is suggested that the frame and support files be copied to the same directory as the attributed grammar, and then the output files will be generated in that same directory. Alternatively, set the environment variable CRFRAMES to point to a directory where Parser.frame, Scanner.frame and Driver.frame are to be found, for example

        SET CRFRAMES=C:\COCO\FRAMES

This version of Coco/R is run in "command-line" mode, with a parameter specifying the name of the attribute grammar to be processed, for example:

   Coco MyGrammar.atg

Optional extra command line parameters may be added in a familiar way to control the generation and trace output, for example

   Coco -FX MyGrammar.atg
Of course, the executable COCO.EXE must be found along the system PATH. Permissible command line parameters include

  -A   trace automaton   -C   generate compiler driver
  -F   list first/follow sets   -G   print syntax graph
  -I   trace first sets   -J   display ANY and SYNC sets
  -M   merge error messages with trace output   -N   generate symbol names
  -P   display statistics   -S   list symbol table
  -T   test grammar only   -X   list cross reference table

By default, syntactic and semantic error messages and warnings are sent to the standard output stream (System.out) in a format that several editors can use to highlight the errors. If the -M option is exercised the error messages are merged with a source listing and sent to the trace output file listing.txt. It is possible to develop yet other ways of dealing with error output (see section 3.4).

3.1 The Parser interface

A parser generated by this version of Coco/R has the following interface:

   class Parser {
     static Token token;              // last recognized token
     static Token t;                  // lookahead token
     static void Parse();             // instigate parse
     static void Error(int n);        // report syntax error n
     static void SemError(int n);     // report semantic error n
     static void SemError(string s);  // report semantic error using string s
     static bool Successful();        // return true if no errors reported
     static string LexString();       // return exact text of parsed token
     static string LexName();         // return text of parsed token (uppercase?)
     static string LookAheadString(); // return exact text of lookahead token
     static string LookAheadName();   // return text of lookahead token (uppercase?)
   }

The parser is started by calling the method Parse(). This is done in the main method of the compiler (see Section 3.5). Parse and other internal private methods of the class repeatedly call the scanner to get tokens, and execute semantic actions at the appropriate places.

The field token holds the most recently parsed token. The field t is the lookahead token that has already been obtained from the scanner but not yet parsed. In a semantic action, token refers to the token immediately before the action commences, and t to the token immediately visible after the action.

The other methods represent extensions to the original implementation, and have been provided for convenience, particularly for programmers familiar with the C, Pascal and Modula versions of Coco/R. These methods provide hooks into the parser and error reporters which may be useful in developing semantic actions within the parser, as well as in developing methods in supporting classes like table handlers.

Calls to Error(n) are automatically inserted into the generated parser for the purposes of reporting syntax errors. The error numbers n are generated by Coco/R along with appropriate error messages that can be displayed and related to the position of the lookahead token. There may be situations where supporting classes also find it convenient to invoke this method. However, these are more likely to wish to invoke the SemError(n) method with an appropriately chosen error number, which will report a semantic error message related to the position of the most recently parsed token. In order to obtain meaningful messages, a programmer should arrange for these to be incorporated into a method such as the SemErr method found in the specimen driver classes in the distribution, which can be installed into the Errors class.

The latest versions of Coco/R allow for users to invoke a method SemError(string) giving an approriate message in the form of a string instead. Both methods of error reporting may be used and intermingled; it is thought that the string version will prove far simpler to coordinate.

A call to the method Successful() will return true if parsing has revealed no syntactic or semantic errors up to the time when the call is made.

The remaining methods return a string that represent a lexeme for one of the tokens scanned. LexString() and LexName() return identical strings unless the scanner has been built with the IGNORE CASE option selected; in this case LexName returns the string converted to upper case.

3.2 The Token interface

A token obtained from the scanner has the following structure:

   class Token {
     int kind;    // token kind
     int pos;     // token position in the source file (starting at 0)
     int line;    // token line (starting at 1)
     int col;     // token column (starting at 1)
     string str;  // token text (exactly as found in the source)
     string val;  // token text (converted to upper case if IGNORE CASE used)
   }

Knowledge of this structure may be of use in the construction of semantic actions and support methods. Values of kind are assigned by Coco/R dynamically. Values of pos, line and col may be of use in error reporting. However, in many applications the inner details of a token may not be needed, as the parser interface provides methods for dealing with error reporting.

3.3 The Scanner interface

A scanner generated by Coco/R has the following interface:

   class Scanner {
     static void Init(String file);                  // constructor
     static Token Scan();                            // scan for next token
   }

The actual scanner is provided by the method Scan() which returns a Token object every time it is called by the parser (these calls are generated by Coco/R; users should not call Scan directly). When the input is exhausted, it returns a special end-of-file token (with the token code 0) that is known to the parser.

The scanner has to be initialized by calling the method Init, passing it the name of the source file to be scanned.

The scanner makes use of a generated Buffer class with the interface

   class Buffer {
     static void Fill(string fileName)  // fills scanner internal buffer with contents of file
     static int Read();                 // reads the next source character
     static int Pos;                    // Property gets/sets the reading position
   }
but this is of little direct interest to the programmer.

3.4 The Errors interface

Coco/R generates an Errors class that can be used for reporting errors. This class has the following interface:

     delegate void ErrorProc(int n, int line, int col);
     delegate void StoreProc(int n, int line, int col, string s);
     delegate void SummaryProc(string s);

     class Errors {
     static int count;                // number of errors detected
     static ErrorProc SynErr;         // syntactic errors
     static ErrorProc SemErr;         // semantic errors
     static StoreProc StoreError;     // store error for future treatment
     static SummaryProc Summarize;    // summarize number of errors reported
     static string fileName;          // the source file name

     static void Exception(string s); // report fatal error and stop
   }

Coco generates and installs into SynErr a method that stores appropriate syntax error messages. This method is automatically called by the parser through the Parser.Error method, in contrast to SemErr which is explicitly called by the programmer through the Parser.SemError(integer> method. By default, SemErr contains a procedure that stores just the line and colum of a semantic error as well as the error number. Programmers who wish to generate more readable error messages have to install their own procedure in SemErr as in the following code fragment:

   Errors.SemErr = new ErrorProc(MySemErr);

   static void MySemErr(int n, int line, int col) {
     Errors.count++;
     string s;
     switch (n) {
       case 1: s = "Name not found"; break;
       case 2: s = "Name already declared"; break;
       ...
       default: s = "Semantic error " + n; break;
     }
     Errors.StoreError(n, line, col, s);                                        /* PDT 40 */
   }

Alternatively, and more simply, programmers can make use of the Parser.SemError method.

After parsing, the number of detected errors can be obtained from count.

The default StoreError simply prints the error messages to System.out. A user can install a more sophisticated system, as exemplified in Coco itself, where the $M pragma or -M compile line option installs a method that merges the messages into a highlighted source listing as part of the trace output file listing.txt.

The method Exception(s) can be called by the programmer when a serious error occurs which makes any continuation impossible. After printing the parameter s the compilation is aborted.

3.5 The driver program

The main class must be provided by the programmer. It has to initialize the scanner, possibly initialize Errors.SemErr, and call the parser. In its simplest form it looks as straightforward as this:

   public class MainProgram {

     public static void Main(string[] arg) {
       Scanner.Init(arg[0]);
       Parser.Parse();
       Console.WriteLine("{0} errors detected", Errors.count);
     }
   }

In this version of the system provision has been made for the generation of a driver program if the $C pragma or -C command line option is exercised. In its simplest form this driver class is generated from a frame file Driver.frame. The generic driver might suffice for simple applications, but is likely to require some modification, and in particular to

The driver generation system looks for a file Grammar.frame (where Grammar matches the goal symbol of the parser) to use in preference to the generic file Driver.frame. It is suggested that if users wish to make use of this feature they should use Driver.frame as a guide towards developing a frame which can be used for their specific application.

3.6 Grammar tests

Coco/R performs several tests to check that the grammar, besides being syntactically correct, is also semantically well-formed. If one of the following error messages is produced, no compiler parts are generated.

No production for X
The nonterminal X has been used but there is no production for it.
X cannot be reached
There is a production for nonterminal X, but X cannot be derived from the start symbol.
X cannot be derived to terminals
For example, if there is a production X = "(" X ")" .
X --> Y, Y --> X
X and Y are nonterminals with circular derivations.
Tokens X and Y cannot be distinguished
The terminal symbols X and Y are declared to have the same structure, e.g.,
    integer = digit {digit} .
    real = digit {digit} ["." {digit}] .
In this example, a digit string could be recognized either as an integer or as a real.

The following messages are warnings. They may indicate an error but they may also describe desired effects. The generated compiler parts are valid, although if an LL(1) error is reported for a construct X one must be aware that the generated parser will choose the first of several possible alternatives for X.

X deletable
X can be derived to the empty string, e.g., X = {Y} .
LL(1) error in X: Y is the start of more than one alternative
Several alternatives in the production of X start with the terminal Y, e.g.,
  Statement = ident ":=" Expression | ident [ActualParameters] .
LL(1) error in X: Y is the start and successor of a deletable structure
Deletable structures are usually denoted by [...] and {...}, e.g.,
    qualident = [ident "."] ident .
    Statement = "IF" Expression "THEN" Statement ["ELSE" Statement] .
The ELSE at the start of the else-part may also be a successor of a statement. This LL(1) conflict is known under the name "dangling else".

3.7 Trace output

Coco/R can produce various output that can help in spotting LL(1) errors or in understanding the generated parts. Trace output can be switched on with command line options, or with the pragma

   '$' {digit | letter}

at the beginning of the attributed grammar (for example $FS). The output goes to the file listing.txt. The effect of the switches is as follows:

   0 or A:
   1 or F:
   2 or G:
   3 or I:
   4 or J:
   5 or T:
   6 or S:
   7 or X:
   8 or P:
   9 or M:
prints the states of the scanner automaton
prints the First and Follow sets of all nonterminals
prints the syntax graph of the productions
traces the computation of the First sets
prints the sets associated with ANY and SYNC
tests the grammar only; generate no scanner or parser
prints the symbol table (terminals, nonterminals, pragmas)
prints a cross reference list of all syntax symbols
prints statistics about the Coco/R run
prints error messages merged with a source listing

4. Hints for Advanced Users of Coco/R

Providing a hand-written scanner

Scanning is a time-consuming task. The scanner generated by Coco/R is optimized, but it is implemented as a deterministic finite automaton, which introduces some overhead. A manual implementation of the scanner can be slightly more efficient. For time-critical applications a programmer may wish to generate a parser but provide a hand-written scanner. This can be done by declaring all terminal symbols (including literals) as tokens, but without defining their structures by EBNF expressions, e.g.,

   TOKENS
     ident
     number
     "if"
     ...

If a named token is declared without structure, no scanner is generated. Tokens are assigned numbers in the order of their declaration; i.e., the first token gets the number 1, the second the number 2, etc. The number 0 is reserved for the end-of-file symbol. The hand-written scanner has to return token numbers according to this convention. It must have the interface described in Section 3.3.

Tailoring the generated compiler parts to specific needs

Using a generator usually increases productivity but decreases flexibility. There are always special cases that can be handled more efficiently in a hand-written implementation. A good tool handles routine matters in a standard way, but gives a user the chance to change the standard policy if this seems appropriate. Coco/R generates the Scanner, Parser and ErrorStream classes from source texts (so-called frames) stored as the files Scanner.frame and Parser.frame. It does so by inserting grammar-specific parts into these frames at the places marked with distinctive -->keys . A user may edit the frames (within reason!) and thereby change the internally used algorithms. For example, a user can implement a different buffering scheme for input characters.

Accessing the lookahead token

The generated parser offers the lookahead token t that can be used to access the next token in the input stream (i.e. the one that has already been scanned but not yet parsed). The following example shows one way of computing the start and the end position of a sequence of tokens enclosed in curly brackets:

   "{"     (. start = t.pos; .)   // start of the first ANY
   {ANY}   (. end = token.pos; .) // start of the last ANY
   "}"

Controlling the Parser by semantic information

Ideally, syntax analysis should be "context-free", that is, independent of semantic analysis (symbol table handling, type checking, etc.). However, many languages have constructs that can only be distinguished if one also considers semantic information (e.g., the type) associated with various identifiers. In the language Oberon, for example, a Designator is defined as

   Designator = Qualident
                {"." ident | "^" | "[" ExprList "]" | "(" Qualident ")" } .

where in the context of an Expression, code like x(T) means a type guard (i.e., x is asserted to be of type T). A Designator may also be used in a Statement

   Statement = ... | Designator ["(" ExprList ")"] | ...  .

but in this context x(T) can be interpreted as a regular procedure name x followed by a parameter T. The two interpretations of x(T) can only be distinguished by looking at the type of x. If x is a regular procedure then the opening bracket is the start of a parameter list, otherwise the bracket belongs to a type guard.

Coco/R allows control of the parser from within semantic actions to a certain degree. A Designator, for example, can be processed in the following way:

   Designator <^Item x> =
     Qualident <^x>
     {                        (. if (...x is a procedure...) return; .)
       "(" Qualident <^y> ")" (. ... process type guard ... .)
     | ...
     } .

When an opening bracket is seen after a Qualident, the alternative starting with an opening bracket is selected. The first semantic action of this alternative checks for the type of x. If x is a regular procedure, the parser returns from the production (and presumably continues in the Statement production whence it was invoked).

5. A Sample Compiler

This section shows how to use Coco/R for building a compiler for a tiny programming language called "Taste". Taste bears some resemblance to Modula-2 or Oberon. It has variables of type INTEGER and BOOLEAN, and regular procedures without parameters. It allows assignments, procedure calls, and IF- and WHILE-statements. Integers may be read from an input stream and written to an output stream, one to a line. Expressions may incorporate arithmetic operators (+, -,* , /) and relational operators (=, <, >).

Here is an example of a Taste program, which can be found in the file Test.TAS:

  MODULE Example;
    VAR n: INTEGER;

    PROCEDURE SumUp; (*build the sum of all integers from 1 to n*)
      VAR sum: INTEGER;
    BEGIN
      sum:=0;
      WHILE n>0 DO sum:=sum+n; n:=n-1 END;
      WRITE sum
    END SumUp;

  BEGIN
    READ n;
    WHILE n>0 DO SumUp; READ n END
  END Example.

The full grammar of Taste can be found here. Of course Taste is too restrictive to be used as a real programming language. Its purpose is just to give one a taste of how to write a compiler with Coco/R.

Although the grammar of the Taste language is identical with the Linz release, some of the internal details have been altered in this release.

Execution

The Taste compiler is a compile-and-go compiler, which means that it reads a source program and translates it into a target program which is executed (i.e. interpreted) immediately after the compilation. Once it has been built, the Taste system can be run in "command line mode", for example

   Taste Test.TAS

After a succesful compilation the program is interpreted, and the user is prompted for the source of data and the sink for the output.

The Target Code

We define an abstract stack machine as the target for the translation of Taste programs. The compiler translates a source program into instructions for this machine, which will later be interpreted. The machine uses the following data structures (the code array is filled by the compiler):

   char code[];   code memory
   int stack[];   stack with frames for local variables
   int top;       stack pointer (points to next free stack element)
   int pc;        program counter
   int base;      base address of current frame

The instructions have variable length and are described by the following table (the initial values of the registers are: base=0; top=3;):

LOAD l,a  Load value          Push(stack[Frame(l)+a]);
LIT i     Load literal        Push(i);
STO l,a   Store               stack[Frame(l)+a]=Pop();
ADD       Add                 j=Pop(); i=Pop(); Push(i+j);
SUB       Subtract            j=Pop(); i=Pop(); Push(i-j);
DIV       Divide              j=Pop(); i=Pop(); Push(i/j);
MUL       Multiply            j=Pop(); i=Pop(); Push(i*j);
EQU       Compare if equal    j=Pop(); i=Pop(); if (i==j) Push(1); else Push(0);
LSS       Compare if less     j=Pop(); i=Pop(); if (i<j) Push(1); else Push(0);
GTR       Compare if greater  j=Pop(); i=Pop(); if (i>j) Push(1); else Push(0);
CALL l,a  Call procedure      slink=Frame(l); Push(slink); Push(base); Push(pc+2);
                              pc=a; base=top-3;
RET       Return from proc.   top=base+3; pc=Pop(); base=Pop(); dummy=Pop();
RES i     Reserve frame space top=top+i;
JMP a     Jump                pc=a;
FJMP a    False jump          if (Pop()==1) pc=adr;
HALT      End of the program  Halt;
NEG       Negation            Push(-Pop());
READ l,a  Read integer        stack[Frame(l)+a]=Read();
WRITE     Write integer       Writeln(Pop());

The function Frame(l) returns the base address of the stack frame which is (statically) l levels up in the stack. l=0 means the current frame, l=1 means the statically surrounding frame, etc. The main program is also considered as a procedure with a stack frame. Figure 2 shows the format of a stack frame.



Fig.2 Format of a stack frame

For example, the source code instruction

   i := i + 1

is translated into the code

   LOAD 0,3  load value of i (at address 3 in current frame)
   LIT 1     load constant 1
   ADD
   STO 0,3   store result to i

Classes

Figure 3 shows the primary classes of the compiler (an arrow from A to B means that class A is used by class B).



Fig.3 classes of the Taste compiler

The classes in the rectangle are generated by Coco/R, the others are written by hand. The classes have the following responsibilities:

Taste Driver class. This obtains the name of the source file, initializes the scanner and calls the parser and the interpreter. This file also includes the SemErr method for reporting meaningful semantic error messages.
Parser Recursive descent parser generated from the attributed grammar Taste.ATG. The parser contains the semantic actions from the attributed grammar. They are called and executed during parsing and do the actual compilation work.
Scanner Scanner generated from Taste.ATG.
TL Symbol table with methods to handle scopes and to store and retrieve semantic information associated with identifiers.
TC Code generator with methods for emitting instructions and storing these in the simulated memory for later interpretation. This class also contains the interpreter method and its related data structures.

For instructions on where to download all these components together see Section 7.

Summary - how to build your own compiler

If you want to build a similar compiler, but for a source language of your own, proceed as follows:

  1. Describe the structure of the lexical tokens (identifiers, numbers, etc) as exemplified in the CHARACTERS and TOKENS sections of the file Taste.ATG.
  2. Describe the phrase structure and the translation process of the compiler in the form of an attributed grammar, as exemplified in the PRODUCTIONS section of the file Taste.ATG.
  3. Feed the complete ATG file to Coco/R to get the scanner and parser of the compiler (in the form of C# classes named Scanner.cs and Parser.cs).
  4. Write a driver program with a main method. This has to initialize the scanner and call the parser. Preferably also provide and install an application-specific SemErr method that will handle numbered semantic error messages in a meaningful way. These classes are exemplified in the files Driver.frame and Taste.cs
  5. Write any other classes needed by the compilation process. These classes typically contain methods that are called from the semantic actions of the attributed grammar. For example, the Taste compiler has a symbol table class (TL.cs) and a code generator (TC.cs).
  6. Compile the generated parts and the hand-written parts to get a running compiler.

6. Useful applications of Coco/R

Coco/R has been used for various tasks (not only for writing compilers). The following list should give you some ideas about useful applications:

7. Sources of Coco/R

We emphasize again that this version of Coco/R for C# is different from the official Linz release.

The sources of this version of Coco/R for C# are as follows:

Coco.ATG The attributed grammar. Describes the processing of a compiler description.
Coco.cs Driver class generated from Coco.frame. Initializes the scanner and calls the parser. This file also contains the custom semantic error message handler SemErr.
Scanner.cs Scanner generated from Coco.ATG.
Parser.cs Parser generated from Coco.ATG.
Tab.cs Symbol table of Coco. Stores information about terminals and nonterminals.
DFA.cs This class builds the scanner automaton and generates the scanner source file.
ParserGen.cs This class builds a syntax graph of the grammar rules and generates the parser source file from it.
Scanner.frame This is the frame file from which the scanner is generated. Coco/R inserts compiler-specific parts into this frame file.
Parser.frame This is the frame file from which the parser is generated. Coco/R inserts compiler-specific parts into this frame file.
Coco.frame This is the frame file from which the driver class is generated. Coco/R inserts compiler-specific parts into this frame file.

If you would prefer to download a ZIP file with the latest .cs source files and the .frame and .ATG files for Coco/R and Taste, you can get it from here in Win95 ZIP format, or from here in Unix ZIP format, or from here in Unix .tgz format, or from here in Macintosh ZIP format (these formats differ only in the way in which line marks appear in the source files).

The ZIP files must be decompressed with a tool that will retain long file names and directory structures.

8. Portability issues

Coco/R has been developed so that it can accept input and frame files in any of the three common text file formats (Unix, MS-DOS, Macintosh); it generates output in the format of the host system.

9. Literature

CocoDataStructures.pdf provides a short description of the major data structures of the system.

Coco.Report.ps.Z
The original report on Coco/R, including an implementation description.

Coco.Paper.ps.Z
H.Mössenböck: A Generator for Production Quality Compilers. Lecture Notes in Computer Science 477, Springer-Verlag, 1990.

Compiler book using Coco/R
P.D.Terry: Compilers and Compiler Generators - An Introduction With C++. International Thomson Computer Press, 1997
(This book uses the C++ version of Coco/R, and also supplies case studies for the Modula-2 and Pascal versions.)

The book is, unfortunately, now out of print. International Thomson decided to drop their Computer Press division, and this title was one of those to be dropped after copies ran out. The copyright was returned to the author, who has now created an online version at http://www.scifac.ru.ac.za/compilers. This site also contains compressed downloadable files of the online edition, and files from which a paper copy may be produced on an HP Laserjet compatible printer.