Skip navigation links

Package com.goldencode.p2j.preproc

Implements a full set of classes including lexers, parsers, utility classes and command line driver code that constitute a stand alone Progress 4GL compatible language preprocessor.

See: Description

Package com.goldencode.p2j.preproc Description

Implements a full set of classes including lexers, parsers, utility classes and command line driver code that constitute a stand alone Progress 4GL compatible language preprocessor.

Author
Nick Saxon
Greg Shah
Date
September 10, 2007
Access Control
CONFIDENTIAL

Contents

Introduction
Lexers, Parsers and ANTLR
Progress Preprocessor ANTLR grammars
Preprocessor Tasks
Package Class Hierarchy
Conceptual Description
Special Character Support
Operator and Function Support
What Is Not Implemented
Useful References
Preprocessor Entry Point
Options
ClearStream Class
TextLexer
TextParser
StatLexer
StatParser
BracesLexer
BracesParser
Command Line Driver
Developer Technical Notebook
Known Issues
Tab Expansion Issue

Introduction

Lexers, Parsers and ANTLR

A good introduction into the field of lexical analysis and parsing can be found here. The implementation of the lexers and parsers for the Progress preprocessor is based on a project called ANTLR.

Progress Preprocessor ANTLR grammars

A preprocessor to a programming language has to deal with quite different tasks than the language itself. The view of the input from the preprocessor perspective is different, too. In fact, it is so different, that there are just a few commonalities. Therefore, creating a separate grammar describing the preprocessor's view of the Progress source file makes good sense.

The further analysis shows that even inside of the preprocessor, there are multiple views of the input that can be naturally described using more than one grammar. 

There is a main grammar that starts preprocessing. It works as a filter and transforms the input on the fly, without creating a syntax tree. All regular Progress language constructs as well as conditional statements are processed there.

As soon as this grammar recognizes other preprocessor statements, it activates another grammar specific to preprocessor statements. This grammar is coded to handle easily all line-oriented preprocessor statements.

Finally, a third grammar is designed to process exclusively the preprocessor references of all kinds, which are coded as a text enclosed in braces.

Preprocessor Tasks

The preprocessor itself is designed to handle a single external procedure (and an arbitrary number of included files).  To feasibly preprocess the thousands of 4GL source files in a given project, a Preprocessor Harness will be created.

The Preprocessor is created based on grammars which are used by ANTLR to create a set of lexers and parsers.  These grammara are being implemented with all of the preprocessor logic embedded in Java based actions attached to specific rules as necessary.  This usage of ANTLR is considered proper for the implementation of a filter.  A preprocessor is simple enough to be implemented as a filter.


The Preprocessor conceptually does the following:
  1. Uses an ANTLR generated set of 3 Lexers (to tokenize) and 3 Parsers (to syntactically check and structure the input).
  2. Implements a filter approach where the input stream (a text file) is preprocessed and written to an output stream (stdout or an output file).
  3. At a high level, it is byte for byte compatible with the following Progress Preprocessor features:
    • special character combinations are expanded wherever they appear in the source files
      • each one converts into a replacement symbol
      • these are also referred to as alternative codings
      • see this table for a list of the supported chars
    • comments
      • nested comments are supported
      • comments are passed through to the output unchanged
      • comments can be placed inside preprocessor directives without causing a failure
    • string literals
      • preprocessor references (using curly braces) are expanded
      • tilde prefixed escape sequences are left behind
      • the resulting string is pass through to the output
    • preprocessor variables are created and destroyed using:
      • &GLOBAL-DEFINE name replacement_string
        • can be abbreviated down to &GLOB
        • handles line continuations using tildes
        • creates a variable in the global preprocessor namespace
      • &SCOPED-DEFINE name replacement_string
        • can be abbreviated down to &SCOP
        • handles line continuations using tildes
        • creates a variable in the scoped preprocessor namespace
      • &UNDEFINE name
      • include file or run statement arguments are added to the namespace but are only accessible from the current scope (as in Progress)
      • a scoped symbol dictionary is maintained with these names
    • argument references are expanded
      • named
        • {&name}
        • {&*}
      • positional
        • {0}
        • {n}
        • {*}
      • null
        • {}
    • special preprocessor symbols are replaced
      • these are essentially like global preprocessor variables that always exist
      • the following are supported:
        • BATCH-MODE (boolean value can be controlled via hints)
        • FILE-NAME
        • LINE-NUMBER
        • OPSYS (string value can be controlled via hints)
        • SEQUENCE
        • WINDOW-SYSTEM (string value can be controlled via hints)
      • SpeedScript / WebSpeed related "built-in" variables are supported (but may be redefined which is different from how real built-ins work):
        • DISPLAY
        • END
        • OUT
        • OUT-FMT
        • OUT-LONG
        • OUT-FILE
        • OUT-STREAM
        • WEBSTREAM
        • AMP
        • LT
        • GT
        • QUOT
    • preprocessor name references are expanded
      • the form they use is {&name}
      • The only difference between named argument reference and preprocessor name reference is the origin of the name. The origin of a name is returned via DEFINED(name) preprocessor builtin function. Besides the origin, these two kinds of references work identically.
    • deferred expansion is fully implemented using ~{&name} constructs
    • include file references
      • are expanded into the contents of the include file (if it exists)
      • the contents are preprocessed recursively, including any includes or expansions it may have and so on
      • a space character is appended to the contents of the include file
      • the following forms are recognized:
        • {filename}
        • {filename arg1 arg2}
        • {filename &arg1=val1 &arg2=val2}
        • any other form that cannot be taken as a syntactically correct argument reference
    • &MESSAGE and other preprocessor directives are processed (including abbreviation support)
    • tabs are expanded into spaces on output
    • all name/variable/argument references are implemented using the dictionary, while honoring the global and non-global scope of each reference.
    • conditional inclusion directives are supported
      • &IF expression &THEN..block..[&ELSEIF expression &THEN..block..&ELSE..block..]&ENDIF
      • conditionals can be nested and can cross newline boundaries
      • expressions using the listed operators and functions are fully supported
      • decimal, integer, character, date and logical data types are all supported
      • all forms of constants (literals) are supported, including unknown value
      • the only known feature of the preprocessor that is not implemented is this list of unsupported functions
      • all valid preprocessor expression types are supported
        • numeric (integer or decimal) which is always cast to an integer on return and compared to 0 (0 is false, non-zero is true)
        • string which is not empty (true) or empty (false)
        • boolean which is true or false
      • based on the result of the expression evaluation, the blocks of code are either passed through to the output or not
    • line continuations are processed so that the preprocessed output looks like made of single lines
      • documented combinations like '~', newline
      • undocumented combinations like '\', newline
    • all documented escape sequences with the tilde '~' character as well as all undocumented escape sequences with the backslash '\' character are processed.
  4. Everything else is unrecognized and is assumed to be the Progress 4GL language.
  5. This unrecognized input is supposed to be unchanged and is copied directly to output.
  6. The preprocessor optionally produces a hints file. The hints file is a XML file where information about included files, their arguments, argument usage and reference substitutions is kept.
Special Character support:

Input
Output
;& @
;< [
;> ]
;* ^
;' '
;(
{
;%
|
;)
}
;? ~

Supported operators:

Operator/Function
Type
+
operator
-
operator
*
operator
/
operator
= or EQ
operator
<> or NE
operator
< or LT
operator
> or GT
operator
<= or LE
operator
>= or GE
operator
AND
operator
OR
operator
NOT
operator
BEGINS
operator
MATCHES
operator
MODULO
operator
( )
operator
DEFINED()
Preprocessor symbol dictionary function

Supported functions:
The following functions are NOT IMPLEMENTED.  These are the only known features of the Preprocessor that are not supported. 
 These tasks are distributed between three grammars text.g, stat.g and braces.g and a number of manually written Java classes.

Useful References to the Progress Language Reference Manual can be found below:

Reference Manual Page #
Contents
9
argument reference
12
include file reference
17
preprocessor name reference
22
&GLOBAL-DEFINE directive
24
&IF-&THEN-&ELSEIF-&ENDIF directive
26
expression operators
27
preprocessor functions
28
&MESSAGE directive
29
&SCOPED-DEFINE directive
30
&UNDEFINE directive
393
DEFINED() function

Package Class Hierarchy

The following is a high level UML class diagram of the preproc package:

preproc Class Diagram

For a more detailed diagram, use this link.

Preprocessor Entry Point

The main entry point into the preprocessor is its only constructor. The preprocessor is invoked by creating an instance of the class and does its work while in the constructor.

The I/O subsystem of the preprocessor builds on the MultiStream and the ClearStream classes. The MultiStream class provides a generic switchable input stream implementation. This functionality is required for the transparent include files support. The ClearStream class functionality is discussed later.

To keep references to multiple objects the preprocessor needs during its lifetime, an instance of the Environment class is created.

The preprocessor constructor takes a FileScope instance as a parameter. This object represents the top level source file to be preprocessed. The constructor also creates the dictionary with the global scope and associates the given FileScope with it.

An instance of the BuiltinVariable class is created. This automatically populates the global scope of the dictionary with all known builtin variables.

Finally, the parsers and lexers are created and the TextParser is called to start the preprocessing.

See Preprocessor class description for more details.

Options

Preprocessor options are represented by Options class. An instance of Options has to be created and initialized before the preprocessor is called. See Options class description for details.

ClearStream

The requirement for rescanning of the substitutions for the preprocessor references of all kinds makes it impossible to use ANTLR generated lexers right on the input files. ANTRL generated lexers simply cannot rescan what is a dynamic input. This is where the ClearStream class helps. It creates a static input stream which needs no rescanning. This only can be done by performing transparent substitutions for the references inside the ClearStream and producing another stream where there are no references.

This approach makes it possible to apply the ANTLR tool to the preprocessor. A separate grammar, braces.g, is created to process the references from inside the ClearStream class. It will be used as many times recursively as there are nested braces on input.

On output from the ClearStream, a regular lexer and parser defined in text.g do their work as if there were no references at all.

This is a rough description of the process. The ClearStream class does more than eliminating references. For full details, see ClearStream class description.

TextLexer

TextLexer tokenizes the input for the purpose of preprocessing it. This requirement is much weaker than the full tokenization of the source for compilation etc.

This lexer recognizes:
The ability to handle arbitrary 4GL input makes it a filter. Tabs detected on input are expanded using variable number of spaces to advance the column number to the next tab stop.

TextLexer maintains state information of these two kinds:
TextLexer communicates this state information to ClearStream every time the change is detected. In general,  sharing state information between the lexer and the ClearStream code is complicated due to the lookahead processing that the lexer is based on. The lookahead (k=2 meaning two buffers in case of TextLexer) buffers are filled with the characters from the input stream (provided by ClearStream in this case) before the recognition is done. If the first character changes the state of the lexer, and that change has to be communicated to the ClearStream, the signal comes only after the second character has been read into the lookahead buffer 2. This is a generic problem when using lookahead buffers, not an implementation issue.

To make sure the shared state information works properly even though there are known generic problems , every use of this information has to be certified for correctness.  The following sections do that for all uses of the shared state information in the TextLexer and ClearStream.

4GL comments generate two signals: entering a comment and leaving a comment. TextLexer sends these signals using setInComment(true) and setInComment(false) calls. Both opening and closing sequences are exactly two characters: "/*" and "*/". The rule for comments does the recognition using the lookahead buffers 1 and 2. When the positive recognition is made, the signal is sent to the ClearStream. In both cases, this happens just in time - no character following comment opening/closing sequence has been read yet.

4GL strings also generate two signals: entering a string and leaving a string. TextLexer sends these signals using setInString(true) and setInString(false). With the strings the situation is different. Opening character is either apostrophe or double quote, but just a single character. The recognition is made based on the lookahead buffer 1 only. So the "entering a string" signal is sent in time - no character following the opening character has been read yet.

The string termination needs both lookahead buffers, however. This is so because strings allow doubled opening character as part of their contents. The lookahead buffer 2 is used to check if it is a doubled opening character (meaning no string closure) or the closing character. Thus, when the recognition is made, both lookahead buffers are used, but the lookahead buffer 2 contains a character which does not belong to the string. When the "leaving a string" signal is sent, it comes exactly one character late to the ClearStream. This constitutes a late signal problem. The problem, however, can be solved through some additional cooperation between the TextLexer and the ClearStream.

The way this shared state signal is used in the ClearStream limits the scope of the problem to just one case: if a Progress 4GL escape sequence which encodes either CR (carriage return) or NL (new line) character immediately follows the string closing character. This is the only troublesome combination, because the ClearStream normally inserts another CR character in front of every converted CR and NL while inside a string. In this case, a CR is inserted due to the late signal problem. But the TextLexer can easily detect this failure in the mSTRING method and fix it right away by consuming this extra character.

See TextLexer,ClearStream, mSTRING for details.

TextParser

TextParser is the top level recognizer of the preprocessor input. Its primary function is to process the conditional blocks of input interpreting only those blocks which have to be interpreted and skipping those which do not.

It starts looking at the tokens on input assuming they belong to a block which has to be interpreted. Occasionally it may see &IF - &THEN - &ELSEIF - &THEN - &ELSE - &ENDIF constructs. Those require their expressions to be evaluated. 

Expression evaluation is done in three steps:
  1. The parser simply gathers all relevant tokens into a string and calls the Preprocessor.evaluate() method.
  2. The latter instantiates necessary objects to interface with the real expression evaluator and registers various callbacks that resolve function names and variables in case they are referenced in the expression and calls the ProgressExpressionEvaluator.evaluateExpression() method.
  3. ProgressExpressionEvaluator.evaluateExpression() does all the evaluation work.
The result of the expression evaluation affects the way the following blocks of tokens are processed: they are either interpreted or skipped.

Skipped blocks are ignored completely leaving no trace on the preprocessor output. Interpreted blocks are made of either 4GL code tokens or the preprocessor statements. TextParser copies the 4GL tokens to the output and activates another grammar, stat.g, to conclude parsing and interpretation of the preprocessor statements.

For details, see TextParser, CallbackResolver, Keyword, ProgressExpressionEvaluator.

StatLexer

StatLexer tokenizes a portion of the input that makes a preprocessor statement extending through the end of line. This makes it a very simple lexer.

This lexer recognizes:
Tabs detected on input are expanded using variable number of spaces to advance the column number to the next tab stop.

See StatLexer for details.

StatParser

StatParser is the recognizer and the executor of the preprocessor statements. This parser recognizes and properly handles the following statements: The TextParser calls the StatParser upon recognition of one of the listed preprocessor statements. The StatParser ends as soon as it sees and consumes the newline token on input. The TextParser then resumes.

&GLOBAL-DEFINE and &SCOPED-DEFINE statements create a new preprocessor variable and put it in the appropriate scope. For discussion of scopes implementation, see ScopedSymbolDictionary class description. The defined variable may have an empty string as a value.

&UNDEFINE statement removes the nearest definition of the variable.

&MESSAGE statement outputs the given text to the error stream.

&ANALYZE-SUSPEND and &ANALYZE-RESUME are simply ignored.

The processed statement is then deleted from the input. Comments within the statement, if any, are copied to the output stream.

See StatParser for details.

BracesLexer

BracesLexer is special. It is called (indirectly) from the ClearStream class rather than from the TextParser. Together with the ClearStream driver code, they provide a transparent, non-dynamic input stream implementation for the rest of the preprocessor.

BracesLexer tokenizes a portion of the input within braces. These are preprocessor references.

This lexer recognizes:
See BracesLexer for details.

BracesParser

The only task of the BracesParser is to recognize a type of reference coded in braces and replace it with the evaluation. Once the original reference is evaluated, it gets deleted from the input stream and replaced with the evaluation. The evaluation is pushed back into the input stream. That is the reason why the ClearStream class builds on the PushbackInputStream. The next read() call would reread the pushed back data again. This is how rescanning of the substitutions is implemented.

This parser recognizes and properly handles the following constructs: To produce evaluations of the references, the parser calls helper methods of the Preprocessor class. Evaluation of the argument references is backed up with the FileScope class. Both the defined and builtin variables are evaluated through the access to the scoped symbol dictionary, an instance of the ScopedSymbolDictionary class. Additionally, an instance of the BuiltinVariable class is used to provide the right output for the builtins.

As the result of an include file reference parsing, no immediate substitute string is produced. Rather, another instance of FileScope is created and chained to the new dictionary scope. For details, see Preprocessor.includeFile.

Command Line Driver

The Preprocessor class includes the main method so it can be run from the command line. The syntax to run the preprocessor is:
   java com.goldencode.p2j.preproc.Preprocessor  [-option ...] filename [ file-arguments ]
where
   -option
is one or more of the following:
All options have to preceed the source filename. File arguments, if any, must follow the usual Progress rules for coding positional or named arguments (the operating system / shell application may dictate some specific escape sequences to be used instead of characters like '&', '=', '"' etc).

-debug option turns on the debug output.
-keeptildes option preserves the tildes used in the escape sequences on the output .
-out option directs the preprocessor output to a specific file instead of stdout.
-propath lis the equivalent of the Progress' PROPATH string.
-pushback specifies a different pushback buffer size for the input stream. The default is 1024 and should be sufficient for most runs of the preprocessor.
-lexonly is a special mode when the preprocessor only tokenizes the input file and outputs the stream of tokens without any modification. Useful for debugging the lexer.
-marker option tells the Preprocessor what byte is used as a special internal marker. The default value 1 is likely to be appropriate for the most cases. In general, any value in the ranges 1 to 7, 16 to 32 and 128 to 255 can be used as long as the input file does not produce this byte. If that happens, however, an exception will be raised.
-hints option specifies the filename for the hints file.

This is a sample hints file:
<?xml version="1.0"?>

<!--Preprocessor hints-->
<!--Created Mon May 16 16:22:07 EDT 2005-->
<hints>
  <preprocessor-output>
    <include file="g-permit.i" start-line="57" start-column="1" end-line="99" end-column="2">
      <reference name="{1}" type="argument" value="&quot;General Ledger Financial Rpts&quot;" line="85" column="20" />
      <reference name="{1}" type="argument" value="&quot;General Ledger Financial Rpts&quot;" line="95" column="53" />
      <argument name="{1}" position="1" value="&quot;General Ledger Financial Rpts&quot;" used="yes" />
    </include>
    <include file="can-run.i" start-line="100" start-column="1" end-line="127" end-column="2" />
    <include file="can-lupd.i" start-line="128" start-column="1" end-line="154" end-column="2" />
  </preprocessor-output>
</hints>



Developer Technical Notebook

The data in this "Developer's Notebook" represent a running log of various investigations which went on during the development of the Golden Code implementation of its preprocessor for Progress source code. Please note that this document is relatively unstructured, since items were appended as they were encountered during development. It is not intended to be formal documentation, but may be helpful for anyone trying to understand specific details about the implementation.

To see the notebook, use this link.

Known Issues

Due to the undocumented nature of the Progress internals, it is generally impossible to guess everything right. That leaves room to some subtle issues. They will be mostly related to the order of actions that the Progress takes internally to preprocess a source file.

Tab Expansion Issue

The tabs expansion is particularly sensitive to the internal order of actions. If a line of a source file contains both a Progress reference {...} of any kind and tabs, then the result would vary depending on the order of tabs expansion and reference resolution. Here comes a sample line:
            {11}                      /* comment */
This line has three tabs following the reference {11}. Let's assume the positional argument 11 has the value of '.'. First, let's resolve the reference:
            .                         /* comment */
Now, let's expand tabs:
            .                    /* comment */
Now let's try the same two actions in the reverse order. First, let's expand the tabs:
            {11}                         /* comment */
Now let's resolve the reference:
            .                         /* comment */
If we compare the results side by side, we will find a different number of spaces that replaced tabs:
            .                    /* comment */
. /* comment */
Currently, the P2J Preprocessor does all the processing in one pass, producing the first line. The Progress, apparently, does it in multiple passes, expanding tabs in a pass before resolving references. This produces the second line.

Another known case where the tab expansion issue can be seen is preprocessor variables if their definitions contain tabs.



Copyright (c) 2004-2007, Golden Code Development Corporation.
ALL RIGHTS RESERVED. Use is subject to license terms.


Skip navigation links
Copyright (c) 2004-2017, Golden Code Development Corporation.
ALL RIGHTS RESERVED. Use is subject to license terms.