Resolving Parsing Issues¶

Resolving Parsing Issues
- Common Issues
- Techniques

After the ConversionDriver runs the preprocessor for each file, it passes the expanded source file (the preprocessor cache) to the step that handles parsing the 4GL source code to create a tree structured representation called an Abstract Syntax Tree (AST). This is the first step where the inputs are assumed to be valid 4GL source code. The preprocessor did have requirements for its inputs and there were syntax requirements for the preprocessor directives. But the preprocessor did not expect valid Progress 4GL on input. Only the output of the preprocessor is expected to be valid 4GL.

This next step in the conversion front end is made up of two major components: the lexer and the parser.

The lexer reads the 4GL source code on a character by character basis and converts it into a stream of tokens. This means that characters that are related to each other (e.g. all the characters of the same symbolic name) will be grouped together as a single token. Some languages are easily tokenized (lexed) but Progress 4GL is not one of them. This is mostly due to some unusual behavior of Progress in regards to direct database schema references in source code, the use of abbreviations for keywords and some user-defined symbols, a very large number of keywords (many thousands) and the fact that keywords can be either reserved or unreserved (used as a user-defined symbol). This means that the lexer must have access to the database schema dictionary in differentiate between database references and other symbols. It also means that the lexer must have a great deal of awareness about the complexities of keyword processing.

The parser reads the stream of tokens from the lexer and structures related tokens together into tree structures. Each 4GL program is made up of language statements, assignments and other features organized into blocks that can be nested or in some cases called by name or in response to events. Each feature of the program has its syntactic “sugar” removed and the tokens are converted into a tree form. These trees are interlinked into a larger tree which represents the structure of the blocks including nesting. Each 4GL source program yields a single tree (an AST) that is designed for much easier automated processing than could be easily done on the source text alone. That AST is then stored in the file system as an XML file, which makes it easy to read, search, analyze and modify in an editor or via program.

This step in the conversion process is made more complex by the fact that the lexer and parser run simultaneously. As the parser requests more tokens, the lexer reads further in the file and creates those tokens. When the lexer reaches the end of the file, it notifies the parser via a special end of file (EOF) token. At that point the parser finishes the tree and can return that tree to the ConversionDriver to further processing.

The simultaneous processing does make the debugging problems more difficult since a problem could be in either the lexer or the parser, both of which are non-trivial components.

This chapter provides guidance on how to resolve problems that are occur in running the 4GL compatible lexer and parser in FWD. The most common issues will be discussed. More importantly, some useful techniques will be described that can be used in general to investigate and resolve lexer and parser issues.

Common Issues¶

Schema References¶

The most common problem seen in lexing or parsing relates to database schema references which cannot be resolved or where there are more than one database schema reference which can match. With schema references, there must be exactly one and only one possible match to each reference. In other words, for each database reference (to a database, a table, a field or an index) there must be a single unique definition which matches that symbol. If there are zero matches or more than match, there will be a problem.

Given a file named missing_schema.p with this code:

display non-existent.field.

Something like this output will be seen:

missing_schema.p
line 1:9: unexpected token: non-existent.field
        at com.goldencode.p2j.uast.ProgressParser.lvalue(ProgressParser.java:10602)
        at com.goldencode.p2j.uast.ProgressParser.primary_expr(ProgressParser.java:34744)
        at com.goldencode.p2j.uast.ProgressParser.chained_object_members(ProgressParser.java:48537)
        at com.goldencode.p2j.uast.ProgressParser.un_type(ProgressParser.java:48498)
        at com.goldencode.p2j.uast.ProgressParser.prod_expr(ProgressParser.java:48379)
        at com.goldencode.p2j.uast.ProgressParser.sum_expr(ProgressParser.java:33342)
        at com.goldencode.p2j.uast.ProgressParser.compare_expr(ProgressParser.java:47985)
        at com.goldencode.p2j.uast.ProgressParser.log_not_expr(ProgressParser.java:47861)
        at com.goldencode.p2j.uast.ProgressParser.log_and_expr(ProgressParser.java:47802)
        at com.goldencode.p2j.uast.ProgressParser.expr(ProgressParser.java:14498)
        at com.goldencode.p2j.uast.ProgressParser.display_stmt(ProgressParser.java:24204)
        at com.goldencode.p2j.uast.ProgressParser.stmt_list(ProgressParser.java:19815)
        at com.goldencode.p2j.uast.ProgressParser.statement(ProgressParser.java:4461)
        at com.goldencode.p2j.uast.ProgressParser.single_block(ProgressParser.java:3439)
        at com.goldencode.p2j.uast.ProgressParser.block(ProgressParser.java:3202)
        at com.goldencode.p2j.uast.ProgressParser.external_proc(ProgressParser.java:3131)
        at com.goldencode.p2j.uast.AstGenerator.parse(AstGenerator.java:1182)
        at com.goldencode.p2j.uast.AstGenerator.processFile(AstGenerator.java:686)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:203)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:122)
        at com.goldencode.p2j.convert.ConversionDriver.runScanDriver(ConversionDriver.java:373)
        at com.goldencode.p2j.convert.ConversionDriver.front(ConversionDriver.java:270)
        at com.goldencode.p2j.convert.ConversionDriver.main(ConversionDriver.java:1515)

The line and column numbers actually refer to the preprocessor cache file instead of the nominal source file. The key is that the lexer and parser operate on the preprocessed output, not directly on the input source files. This example output tells the user to look inside missing_schema.p.cache at line 1, column 9 for a reference to a symbol named non-existent.field which is unknown. In particular, an unknown token is a token which does not match any known valid syntactical construct for the progress 4GL. In this case, the text of the token itself suggests that it is a field reference. But the parser stack trace is also helpful in this regard since the failure occurred in the method ProgressParser.lvalue(). That method is a very complex location in the parser, but essentially writable memory locations (such as variables and field references) are resolved as lvalues. “Left values” are often called lvalues in language design. Such left values are entities that can be placed on the left side of an assignment operator, meaning they are writable. In this case, the implication is that there is a missing definition for a variable or a database table or a database field. Since variables can't be qualified using the dot operator (.) the assumption is that this is a missing database schema definition. It might be in a temp-table but more frequently it is a missing permanent database.

As another example, given a file named ambiguous-table.p with this content:

def temp-table tt1 field num as int.
def temp-table tt2 field num as int.

find first tt.

The following error is displayed:

ambiguous-table.p
line 4:12: unexpected token: tt
        at com.goldencode.p2j.uast.ProgressParser.record(ProgressParser.java:9862)
        at com.goldencode.p2j.uast.ProgressParser.record_phrase(ProgressParser.java:30878)
        at com.goldencode.p2j.uast.ProgressParser.find_stmt(ProgressParser.java:24666)
        at com.goldencode.p2j.uast.ProgressParser.stmt_list(ProgressParser.java:19839)
        at com.goldencode.p2j.uast.ProgressParser.statement(ProgressParser.java:4461)
        at com.goldencode.p2j.uast.ProgressParser.single_block(ProgressParser.java:3439)
        at com.goldencode.p2j.uast.ProgressParser.block(ProgressParser.java:3202)
        at com.goldencode.p2j.uast.ProgressParser.external_proc(ProgressParser.java:3131)
        at com.goldencode.p2j.uast.AstGenerator.parse(AstGenerator.java:1182)
        at com.goldencode.p2j.uast.AstGenerator.processFile(AstGenerator.java:686)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:203)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:122)
        at com.goldencode.p2j.convert.ConversionDriver.runScanDriver(ConversionDriver.java:373)
        at com.goldencode.p2j.convert.ConversionDriver.front(ConversionDriver.java:270)
        at com.goldencode.p2j.convert.ConversionDriver.main(ConversionDriver.java:1515)

In this case, on line 4, column 12 (in file ambiguous-table.p.cache) there is the text “tt” which is unexpected. The stack trace is in the ProgressParser.record() method which is what attempts to match table names. This suggests that there is a problem with the table name being unrecognized or ambiguous.

The following lists the possible causes and their matching solutions. No matter what the cause, it is important to note that if the failing database references are invalid or inappropriate, then the 4GL source code can be modified to remove those references. Assuming the database references are in fact valid ones, then the following is relevant:

Cause	Solution
Completely missing schema definition for a permanent database.	The database schema being referenced does not exist in the conversion project. Add the `.df` file to the `data/` directory and modify the global configuration to make the project aware of that schema. See the Project Setup chapter for more details. If the missing schema is not a database that would be loaded by default when the application runs, then hints will be necessary to activate that schema for the source file in question. See the Conversion Hints chapter for details.
Completely missing temp-table or work-table definition.	Find the reason for the missing definition. For example, it is possible that there was a preprocessor issue and the code that would have defined the temp-table or work-table was not included. Or the simple reason may be that the 4GL source code is broken. Resolve the issue and ensure that the temp-table or work-table definition is present in the code.
Missing portions of the schema definition.	Either the database schema (`.df` file) or the temp-table/work-table definition in 4GL source are missing required definition(s). Perhaps there are missing table, field or index definitions. This can happen when the definitions are out of date or are mismatched with the code being parsed. Modify the `.df` file or the 4GL source code to add or fix the definitions.
The source file relies upon a `CONNECT` statement having been executed by calling code.	In FWD, this is simulated by the `database` conversion hint which can be specified at a directory or a file level. See the Conversion Hints chapter for details.
The source file relies upon an `ALIAS` statement having been executed by calling code.	In FWD, this is simulated by the `alias` conversion hint which can be specified at a directory or a file level. See the Conversion Hints chapter for details.
The table or field reference is ambiguous (it matches more than one possible schema definition).	If the 4GL source code is broken, it would need to be fixed. This can be checked by running or compiling the code in the Progress 4GL to compare. If there is an error like: `Unknown or ambiguous table tt. (725)` Then the code is broken and must be fixed. Preprocessing issues also contribute to malformed code that only occurs in FWD and which is not seen in the original 4GL environment. Assuming the code is valid, then there is an overabundance of schema definitions for this file. Perhaps there are databases connected or aliased that would not be available in the 4GL environment. The most likely cause would be setting databases as `default=”true”` in the global configuration when those databases are not normally always connected in the 4GL environment. For such cases the database should not be always loaded for the project and instead conversion hints should be used to load the database as needed for specific files or directories. See the Conversion Hints chapter for more details. Of course, excess conversion hints (either `database` or `alias`) can also cause ambiguous name problems. If the code is valid and the available databases are correct and there should not be an ambiguous name, then one would expect there to be a bug in FWD, which would need investigation and resolution.

In some cases, an instance of AmbiguousSchemaNameException may be thrown and displayed in the error output. In such cases, there is more information regarding the exact list of matching names (there will always be at least two).

Preprocessing Problem¶

Sometimes the preprocessor will yield unexpected or broken results, but the preprocessor will not detect an issue. In such cases, the problem will often become visible during lexing or parsing, since it will often manifest itself as malformed 4GL source code.

It is possible that the preprocessor is doing exactly what it is supposed to be doing, but that the preprocessed results will result in malformed source code. This would be the case if the input files were improperly coded and/or untested.

If the lexer or parser reports some unexpected conditions that are due to malformed source code, and the origins of the malformed source code are caused by preprocessing, then please refer to the Resolving Preprocessing Issues chapter for details on resolution.

Broken 4GL Source¶

When the lexer or parser encounter broken 4GL source code, the problem will often be reported as an “unexpected token”. Given the following source code in a file named broken_4gl_def_var.p:

def var i as .

This is the result:

broken_4gl_def_var.p
line 1:14: unexpected token: .
        at com.goldencode.p2j.uast.ProgressParser.var_type(ProgressParser.java:41779)
        at com.goldencode.p2j.uast.ProgressParser.as_clause(ProgressParser.java:16403)
        at com.goldencode.p2j.uast.ProgressParser.def_var_stmt(ProgressParser.java:8328)
        at com.goldencode.p2j.uast.ProgressParser.define_stmt(ProgressParser.java:7097)
        at com.goldencode.p2j.uast.ProgressParser.stmt_list(ProgressParser.java:19794)
        at com.goldencode.p2j.uast.ProgressParser.statement(ProgressParser.java:4461)
        at com.goldencode.p2j.uast.ProgressParser.single_block(ProgressParser.java:3439)
        at com.goldencode.p2j.uast.ProgressParser.block(ProgressParser.java:3202)
        at com.goldencode.p2j.uast.ProgressParser.external_proc(ProgressParser.java:3131)
        at com.goldencode.p2j.uast.AstGenerator.parse(AstGenerator.java:1182)
        at com.goldencode.p2j.uast.AstGenerator.processFile(AstGenerator.java:686)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:203)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:122)
        at com.goldencode.p2j.convert.ConversionDriver.runScanDriver(ConversionDriver.java:373)
        at com.goldencode.p2j.convert.ConversionDriver.front(ConversionDriver.java:270)
        at com.goldencode.p2j.convert.ConversionDriver.main(ConversionDriver.java:1515)

In this case, the cause is obvious: there is a missing type name causing the DEFINE VARIABLE statement to be syntactically invalid. If the cause was not as obvious, it is important to use this error output to help resolve the issue. The nominal source file, line and column numbers are included. That allows one to examine the code in question. The line and column numbers actually refer to the preprocessor cache file instead of the nominal source file. In this example, the file to examine is actually named broken_4gl_def_var.p.cache. The key is that the lexer and parser operate on the preprocessed output, not directly on the input source files.

The stack trace is of great importance since it “tells the story” of what the parser was doing at the time of the failure. In this case, one can see that a def_var_stmt (DEFINE VARIABLE statement) is being parsed and in particular, the var_type portion of the as_clause was being processed at the moment of the failure. As with all stack traces, reading down from the top (most recent method calls) to the bottom (oldest method calls) is the best approach.

It is possible that the parser error may be reported with limited information and no stack trace. Here is an example, in a file called broken_4gl_def_var3.p with the contents:

def/ne var i as int.

This is the output:

broken_4gl_def_var3.p
WARNING: parser did NOT process to EOF (no match with token immediately BEFORE 1:11 token text ' ').

In this example, the parser detected a very low level problem and could only report that whatever immediately preceded the text at line 1, column 11 of the broken_4gl_def_var3.p.cache file was the cause of the problem. Here we see that when the parser got to the “i” in column 11, it realized that it could not recognize the prior tokens (due to a typographical error in this case).

Generally, all problems will be reported by the parser, even though the cause of the issue could be in the lexer. The reason is that the lexer was designed to create tokens even for completely unknown and invalid 4GL constructs and/or characters. It is the parser's responsibility to determine in which conditions, those unknown tokens are really a failure. This strange design is dictated by unusual design decisions made in the Progress 4GL, where there are many areas of the parser and lexer which radically change their approach based on context. For example, there are language statements such as OS-COMMAND which accept unquoted arbitrary content. This content is expected to be meaningful to the operating system shell, but it has no meaning in the 4GL. Unfortunately, this content is directly coded into the 4GL source file to be parsed only based on the context provided by parser's knowledge of the 4GL language. In most programming languages, some non-trivial amount of syntax behavior can be implemented in the lexer. In Progress 4GL, this is not practical.

There are many potential causes of broken 4GL source code other than preprocessing problems. Since many Progress 4GL projects are not fully compiled in advance, there can be code included in a project which is syntactically broken but which will not cause a problem unless a user executes that code. It is common to find old/unused code, code that is rarely used, untested code, or uncompleted code in 4GL projects. The approach to resolving the problem with broken code is simple: remove it or fix it.

Parsing Failures with Object Oriented Classes or References¶

There are two common problems that can be encountered:

If a .cls file in the application cannot be found, make sure that it exists and the package can be found via a search of the PROPATH. If that is the case, then there is most likely a problem with the class_map.xml.
If there is a failure referencing built-in 4GL OO classes (e.g. Progress.Lang.Object) or a .NET class (in an assembly or a built-in .NET class like System.ComponentModel.Component), then it is likely a problem with a missing, incomplete or incorrect skeleton class.

To resolve both of these issues, please see the chapter entitled Object Oriented Classes and References.

Conflicting Keywords¶

Progress 4GL has thousands of keywords (some reserved and some unreserved). Reserved keywords cannot be used as the name of a user-defined symbol (such as a variable name or a user-defined function name). They are important to enable parsing to be done on a non-ambiguous basis. Based on this keyword-heavy design, whenever Progress 4GL adds new language features, it generally adds new syntax to the language as well. New syntax in the 4GL means new keywords, some of which are reserved keywords to make it possible to problem disambiguate the syntax of the language.

When each new version of the Progress 4GL is released, any new reserved keywords have the potential to conflict with user-defined symbols created in 4GL code that worked in prior versions, but which now will no longer work. Progress 4GL provides a keyword ignore list to drop such conflicting keywords from their special meaning and allow them to be used as user-defined symbols. FWD allows the honoring of this keyword ignore list. Please see the keyword-ignore parameter in the section on global configuration in the chapter on Project Setup for more details on how to enable this.

If the file named broken_4gl_def_var4.p has this code:

def var call as int.

Prior to Progress OpenEdge version 10 (OpenEdge is just another name for the more recent versions of the Progress 4GL), the text “call” was not a reserved keyword. The FWD parser implements a version that is at v10.1B. Here is the resulting output (if no keyword ignore list was configured or if CALL was not added to the keyword ignore list):

broken_4gl_def_var4.p
line 1:9: unexpected token: call
        at com.goldencode.p2j.uast.ProgressParser.symbol(ProgressParser.java:6140)
        at com.goldencode.p2j.uast.ProgressParser.def_var_stmt(ProgressParser.java:8318)
        at com.goldencode.p2j.uast.ProgressParser.define_stmt(ProgressParser.java:7097)
        at com.goldencode.p2j.uast.ProgressParser.stmt_list(ProgressParser.java:19794)
        at com.goldencode.p2j.uast.ProgressParser.statement(ProgressParser.java:4461)
        at com.goldencode.p2j.uast.ProgressParser.single_block(ProgressParser.java:3439)
        at com.goldencode.p2j.uast.ProgressParser.block(ProgressParser.java:3202)
        at com.goldencode.p2j.uast.ProgressParser.external_proc(ProgressParser.java:3131)
        at com.goldencode.p2j.uast.AstGenerator.parse(AstGenerator.java:1182)
        at com.goldencode.p2j.uast.AstGenerator.processFile(AstGenerator.java:686)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:203)
        at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:122)
        at com.goldencode.p2j.convert.ConversionDriver.runScanDriver(ConversionDriver.java:373)
        at com.goldencode.p2j.convert.ConversionDriver.front(ConversionDriver.java:270)
        at com.goldencode.p2j.convert.ConversionDriver.main(ConversionDriver.java:1515)

If you find a user-defined symbol which in a later version of Progress is a reserved keyword, leverage add it to the keyword ignore list or change all instances of that user-defined symbol to resolve the issue.

Parser Failure¶

If the 4GL source code is valid and there are no problems with preprocessing, keyword ignore lists, database schema references or the like, then there may be a problem with the FWD parser itself. Failures can be caused by processing 4GL source code which is designed for a later Progress version than that which the FWD parser supports. Another area of problems can be caused by use of Progress 4GL language features that are unsupported in FWD or which are undocumented and for that reason previously unknown (and unimplemented). Lastly, the parser's internal logic could simply be broken or defective.

No matter what the cause, the easiest solution is to modify or remove problematic 4GL source code from the project. If that is not easily feasible, then the alternative is to fix or enhance the parser to resolve the issue.

Lexer Failure¶

Less likely than a parser failure, but still theoretically possible, is a lexer failure. Lexer failures can be caused by missing or invalid keyword dictionaries, often from processing 4GL source code which is designed for a later Progress version than that which the FWD lexer supports. Another area of problems can be caused by use of Progress 4GL language features that are unsupported in FWD or which are undocumented and for that reason previously unknown (and unimplemented). Lastly, the lexer's internal logic could simply be broken or defective.

As with the parser, the easiest solution is to modify or remove problematic 4GL source code from the project. If that is not easily feasible, then the alternative is to fix or enhance the lexer to resolve the issue.

Techniques¶

Address the First Problem First¶

A source file with many errors is not necessarily one that has multiple problems. Even if the errors appear to be different or unrelated, they can often be downstream breakage from a prior preprocessing/lexing/parsing issue.

The best approach is to look carefully at the first error in a given file. Resolve that one and then re-parse. Very often you will find the other issues are gone as well. If not, you've not lost much time determining this fact.

Reviewing the AST¶

In order to understand how the conversion process is working with a particular 4GL source file, it is often valuable to look at the abstract syntax tree (AST) that is created during the front end of the conversion. Reviewing this AST can also be useful in debugging problems that occur during lexing and parsing.

Assume there is a file named hello.p with the following content:

def var txt as char init "World".
message "Hello " + txt + "!".

Running the ConversionDriver will create an AST that is the tree structured representation of the 4GL source code. That AST will be stored in an XML file that is named hello.p.ast in the same directory as the original source file. The contents of the AST file would look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Persisted AST-->
<ast-root ast_class="com.goldencode.p2j.uast.ProgressAst">
  <ast col="0" id="253403070465" line="0" text="block" type="BLOCK">
    <ast col="0" id="253403070466" line="0" text="statement" type="STATEMENT">
      <ast col="1" id="253403070467" line="1" text="def" type="DEFINE_VARIABLE">
        <annotation datatype="java.lang.String" key="name" value="txt"/>
        <annotation datatype="java.lang.Long" key="type" value="316"/>
        <annotation datatype="java.util.ArrayList" key="initial">
          <listitem datatype="java.lang.String" value=""World""/>
        </annotation>
        <annotation datatype="java.lang.Boolean" key="vardef" value="true"/>
        <ast col="9" id="253403070468" line="1" text="txt" type="SYMBOL"/>
        <ast col="13" id="253403070469" line="1" text="as" type="KW_AS">
          <ast col="16" id="253403070470" line="1" text="char" type="KW_CHAR"/>
        </ast>
        <ast col="21" id="253403070471" line="1" text="init" type="KW_INIT">
          <ast col="26" id="253403070472" line="1" text=""World"" type="STRING"/>
        </ast>
      </ast>
    </ast>
    <ast col="0" id="253403070473" line="0" text="statement" type="STATEMENT">
      <ast col="1" id="253403070474" line="2" text="message" type="KW_MSG">
        <ast col="0" id="253403070475" line="0" text="" type="CONTENT_ARRAY">
          <ast col="0" id="253403070476" line="0" text="expression" type="EXPRESSION">
            <ast col="24" id="253403070477" line="2" text="+" type="PLUS">
              <ast col="18" id="253403070478" line="2" text="+" type="PLUS">
                <ast col="9" id="253403070479" line="2" text=""Hello "" type="STRING"/>
                <ast col="20" id="253403070480" line="2" text="txt" type="VAR_CHAR">
                  <annotation datatype="java.lang.Long" key="oldtype" value="2257"/>
                  <annotation datatype="java.lang.Long" key="refid" value="253403070467"/>
                </ast>
              </ast>
              <ast col="26" id="253403070481" line="2" text=""!"" type="STRING"/>
            </ast>
          </ast>
        </ast>
      </ast>
    </ast>
  </ast>
</ast-root>

All nodes are contained in the root element ast-root. This element is unique (there is only one in the XML file). It contains a class attribute which defines the fully qualified Java class name of the class that implements each AST node in memory. In this example, the tree as made up of ProgressAst nodes. In a Progress to Java project, there are also ASTs for Java, XML and other intermediate forms or output forms. This allows the TRPL conversion rule-sets to easily create tree structured data to represent replacement functionality, lookup data or other purposes. At the end of the conversion process, the output ASTs are anti-parsed or rendered into their correct file formats (like a .java file for a JavaAst). Since this is the persisted form for an in-memory AST representation, the code that recreates the in-memory AST from this file must know the proper backing class for each node. The class attribute is mandatory for the ast-root element.

Contained within the ast-root are ast elements, which can be arbitrarily nested within themselves. Each ast element is a node in the abstract syntax tree, and the structure of the tree is represented by the nested tree structure of the XML. Only the first ast child element of the ast-root element will be honored. All other ast elements must be children of this first child element.

Each ast element will have attributes for type, text and id. There may be attributes for line and column numbers which are intended to associate the node with a specific location in the input or output file which this AST represents. There is an optional terse attribute in the ast-root element. If present and the value is true, there will be no line and column attributes in each ast element. This is useful when the AST represents a template or other data that does not map directly to a specific input or output file (which would have line and column data).

Each lexer token that is created has a token type which is a unique integer which represents a category of related tokens. For example, the token type for a double quoted or single quoted 4GL string is a number that is represented by the symbolic constant STRING. Lexer tokens also have text which is the set of one or more characters read from the input file and aggregated into the given token. Finally, each lexer token has a source line and column number. Since lexer tokens become the basis for AST nodes in the parser, the type, text, line and column data associated with the lexer token forms the basis for these same attributes in the XML ast elements.

The symbolic constant representing the AST node's token type is rendered into the XML as the value for the type attribute. For example, the double quoted string “Hello ” from line 2, column 9 of the example program has an attribute type=”STRING”. The text from the lexer token is copied into the AST node's text attribute. In the STRING case above, the result is an attribute text=""Hello "". Since the double quote character cannot be emitted as data inside XML, it is converted to the entity ".

Each node in an AST has a 64-bit integer which is rendered into the XML as the id attribute of the ast element. The value of that attribute is the decimal representation of the 64-bit value (a Java long). Please see the section below on Understanding AST Identifers for details.

There is optional hidden attribute for an ast element. If present with a value of true, that AST node is marked as hidden. Tools which honor the hidden attribute will not visit or process hidden nodes. Only some of the FWD conversion tools honor the hidden attribute.

All AST nodes (ast elements in the XML) may optionally contain arbitrary data called annotations. Annotations are stored in an annotation element which will have 3 attributes: datatype, key and value. The datatype attribute defines the Java class with which to parse and represent the value in memory. The key attribute defines the name of the annotation and the value attribute defines the data associated with that named key. So each annotation is a key/value pair of a particular data type. If the annotation type is an array or collection of values, then the annotation element will contain one or more listitem elements, each of which must have a datatype and value attributes. List items have no key since they are accessed via the name of the containing annotation. The order of the listitem elements defines its 0-based index position in the array or list. The topmost listitem element is index 0.

Understanding AST Identifiers¶

Each node in an AST has a 64-bit integer identifier which uniquely indexes that node across the entire project. The upper 32-bits is a unique number assigned to the tree itself. The lower 32-bits is a number that is unique within this tree. The combination makes the node unique project-wide.

AST identifiers are managed in a registry, which is persistently stored in an XML file specified by the registry global parameter. Normally this is a file named $P2J_HOME/cfg/registry.xml. Please see the Global Configuration section of the chapter entitled Project Setup, for details.

The registry XML file is essentially a flat list of mapping elements which look like this:

<mapping id="253403070464" next="253403070482" treename="./hello.p"/>

Each mapping element represents the identifier data for a single AST, whose name is specified in the treename attribute. For input source files, the .ast file extension is not added to that name. This allows the input files to be easily obtained from that attribute. Other AST files (intermediate or output ASTs) may have the full filename listed.

The id attribute of the mapping element contains the unique 32-bit number that is specific to the AST that is referenced in this element. This is referred to as the file or AST ID. That number is rendered in XML as the decimal representation of a 64-bit integer value in which the lower (least significant) 32-bits are 0x00000000. In a situation where only the identifier of an AST node is known, the lower 32-bits can be masked off (set to zero) and the resulting number can be used to find the AST filename in the registry.

The next attribute specifies the number of the next available node-level identifier for this AST. Every time a new AST node has it's identifier assigned, this number is incremented.

Reviewing Lexer Output¶

When debugging a lexing or parsing problem, it is very useful to see the decisions that were made by the lexer. This is possible by reviewing a human readable form of the lexer output.

Running the ConversionDriver front end runs the lexer in a special mode that stores a human readable copy of the token stream in a text file. Assume that a file named hello.p is being processed with the same contents are shown in the previous section (of this same chapter) called Reviewing the AST. The human readable lexer output will be stored in a file that is named hello.p.lexer in the same directory as the original source file. The contents of the file would look something like this:

[00001:001] <KW_DEFINE>                     def
[00001:005] <KW_VAR>                        var
[00001:009] <SYMBOL>                        txt
[00001:013] <KW_AS>                         as
[00001:016] <KW_CHAR>                       char
[00001:021] <KW_INIT>                       init
[00001:026] <STRING>                        "World" 
[00001:033] <DOT>                           .
[00002:001] <KW_MSG>                        message
[00002:009] <STRING>                        "Hello " 
[00002:018] <PLUS>                          +
[00002:020] <SYMBOL>                        txt
[00002:024] <PLUS>                          +
[00002:026] <STRING>                        "!" 
[00002:029] <DOT>                           .
[00003:001] <EOF>                           null

Each line in the file represents a single token and the file in total can be read from top to bottom as the stream of all lexed tokens from the input provided.

The first column of the output is interpreted as "[" line number ":" column number "]". Since the input for the lexer is the preprocessor cache file (hello.p.cache in this example), all line and column numbers are references to that cache file.

The second column is the symbolic representation of the integer token type of that token. In the source code of the FWD lexer and parser, each token is represented by a symbolic name and an associated integer constant. The symbolic name is displayed in angle brackets in the second column. The angle brackets are not part of the symbolic name. Anything that starts with KW_ is a keyword and after the KW_ there are up to 8 characters used to describe the keyword. For example, KW_MSG is the text MESSAGE. Sometimes this symbolic representation is also referred to as the “token type” or just the “type” of a token.

The third column is the actual text from the file that corresponds to that token.

The FWD lexer drops whitespace and comments, but all other tokens are represented in this recorded token stream.

If a problem occurred that caused parsing to halt, the stream will be truncated very close to where the problem occurred. There may be more tokens in this output than might be expected since the parser makes decisions using “lookahead”. Lookahead is the technique of examining one or more of the next tokens in the stream in order to make a decision about how to parse the current token. Since the lexer must supply the lookahead tokens to the parser, it generally will be multiple tokens ahead of the parser at any point in time, including at the moment of any failure in the parser.

Any token of type UNKNOWN_TOKEN would be considered suspicious, though it is not conclusively a problem.

For more details on the meaning of tokens and how the parser uses them, please see the book FWD Internals.

Reviewing Parser Output¶

When debugging a parsing problem, it is very useful to see the decisions that were made by the parer. The XML file that persistently encodes the AST is one way to see this. However, there is a more condensed, human readable form of the parser output which can also be helpful.

Running the ConversionDriver front end runs the parser in a special mode that stores a human readable copy of the parsed tree in a text file. Assume that a file named hello.p is being processed with the same contents are shown in the previous section (of this same chapter) called Reviewing the AST. The human readable parser output will be stored in a file that is named hello.p.parser in the same directory as the original source file. The contents of the file would look something like this:

block [BLOCK] @0:0
   statement [STATEMENT] @0:0
      def [DEFINE_VARIABLE] @1:1
         txt [SYMBOL] @1:9
         as [KW_AS] @1:13
            char [KW_CHAR] @1:16
         init [KW_INIT] @1:21
            "World" [STRING] @1:26
   statement [STATEMENT] @0:0
      message [KW_MSG] @2:1
          [CONTENT_ARRAY] @0:0
            expression [EXPRESSION] @0:0
               + [PLUS] @2:24
                  + [PLUS] @2:18
                     "Hello " [STRING] @2:9
                     txt [VAR_CHAR] @2:20
                  "!" [STRING] @2:26

Each line is a different node in the AST. The first line is the root node for the tree (the same node as the first child ast element of the ast-root element in the AST XML file).

Each 3 space indent signifies a new level of parent/child relationships in the tree. The parent of a given node is the most recent line in the file with an indention level 1 less than the node itself. In the example output above, the BLOCK node has 2 immediate child STATEMENT nodes. The first STATEMENT node has a single immediate child which is a DEFINE_VARIABLE. The second STATEMENT node has a single KW_MSG immediate child. The DEFINE_VARIABLE node has 3 immediate children.

The contents of each line can be interpreted as follows:

text [type] @line:column

Each of these fields has the same meaning as the correspondingly named attributes in the XML AST. See the Reviewing the AST section above for more details.

Querying the Schema Dictionary¶

When debugging potential problems with database schema references, it is very helpful to be able to interactively manipulate and query the schema dictionary for the project. FWD provides an interactive menu system for this purpose. To run this schema dictionary menu program, make your current directory the same as the project root directory ($P2J_HOME) and then use this command:

java -classpath $P2J_HOME/p2j/build/lib/p2j.jar com.goldencode.p2j.schema.SchemaDictionary

There are no parameters to this command. The following menu will be displayed:

---------------------------------
Enter command (case insensitive):
  D for database search
  T for table search
  F for field search
  A to create a database alias
  L to load a database
  U to dump dictionary to ./schema.dmp

  Q to quit
Command:

The first 3 options (D, T and F) allow one to search the currently loaded database schema's for a match to a given fully qualified, partially qualified or unqualified schema name. This search generally honors abbreviations and all the normal behaviors of schema name lookups. The only exception to this is that in the 4GL (and duplicated in FWD) usage of a table within a given scope can have the affect of making that table preferred for the purposes of schema name lookup. This can make lookups that would normally be ambiguous into lookups that resolve properly based implicitly on the 4GL code that preceded it. This behavior is duplicated in the FWD parsing tools but it is not available from this interactive menu.

When it is first started, the only database schemas that will be loaded are those marked as default=”true” in the global configuration (see the Schema Loading section of the Project Setup chapter). If there are additional database schemas that should be loaded before your interactive searches, then use the L option to load those databases dynamically. There is no unload option, so the program will have to be exited and restarted to unload dynamically loaded database schemas from the dictionary.

In the case where a currently loaded database should be aliased to one or more other alias names, use the A option to create aliases. As with the database load option, there is no option to remove an alias. Exit and restart to clear the aliases from memory.

Each option will present sub-menus as needed to execute. Any problems in searches (no matches or more than one match) will be presented on the screen.

Use Q to exit the program.

Running the Lexer Directly¶

It is possible to run the FWD lexer directly (outside of the ConversionDriver). This allows the user to isolate the lexer and may make it easier to diagnose, debug or test specific inputs. The output will be made on the terminal or console and it will look the same as the human readable lexer output shown in the section above on Reviewing the Lexer Output. Please see that section for details on how to interpret this output.

To run the lexer, make your current directory the same as the project root directory ($P2J_HOME) and then use a command such as this:

java -classpath $P2J_HOME/p2j/build/lib/p2j.jar com.goldencode.p2j.uast.ProgressLexer hello.p

There is one mandatory parameter to this command: the relative or absolute filename to a file that is valid Progress 4GL source code (hello.p in the example above).

Since the ConversionDriver is bypassed, the preprocessor is not run and there is no honoring of any of the global configuration (see the Project Setup chapter) or conversion hints (see the chapter on Conversion Hints). This means that important configuration that would normally be present will not be active.

Running the Parser Directly¶

It is possible to run the FWD parser directly (outside of the ConversionDriver). This allows the user to isolate the lexer and parser combination and may make it easier to diagnose, debug or test specific inputs. The output will be made on the terminal or console and it will look the same as the human readable parser output shown in the section above on Reviewing the Parser Output. Please see that section for details on how to interpret this output.

To run the parser, make your current directory the same as the project root directory ($P2J_HOME) and then use a command such as this:

java -classpath $P2J_HOME/p2j/build/lib/p2j.jar com.goldencode.p2j.uast.ProgressParser hello.p

There is one mandatory parameter to this command: the relative or absolute filename to a file that is valid Progress 4GL source code (hello.p in the example above).

There is a single option (-v) that may be specified. If used, it must be provided before the required source file name. If passed, the output will be displayed inside a graphical user interface (GUI) browser that allows one to interactively look at the resulting AST. This GUI is in addition to the human readable parser output shown on the console.

Creating a Simplified, Standalone Testcase¶

Diagnosing and debugging problems that are caused by subtle, hidden or undocumented behavior can be difficult. Likewise, low level problems in the FWD tools can manifest in a manner that is hard to diagnose or debug. A massively useful technique in such cases is the simple, standalone testcase. The idea is to cut down the problematic code into an ever smaller testcase (or the smallest number of source files that can reproduce a problem if a single file cannot do it). Attempt to remove all non-essential code and constructs to get down to a very simple testcase. The more simple the code, the easier the diagnosis and debugging. One way to simplify the code is to remove or minimize that code's dependencies on other code or on database schema definitions. If a problem can be reproduced without any database schema definitions at all, this makes it significantly simpler. This idea of limiting or removing dependencies is described as making the testcase “standalone”.

The best way to do this is to remove code in progressively larger amounts, starting small, testing to see if the recreate still occurs. As long as the problem can be recreated, more code should be removed and the result retested. This iterative process continues until the testcase is as small, simple and standalone as possible while still recreating the issue.

From there, the code can be run through the FWD tools and the diagnosis and debugging can be more productive.

Project

General

Profile

FWD

Wiki

Resolving Parsing Issues¶

Common Issues¶

Schema References¶

Preprocessing Problem¶

Broken 4GL Source¶

Parsing Failures with Object Oriented Classes or References¶

Conflicting Keywords¶

Parser Failure¶

Lexer Failure¶

Techniques¶

Address the First Problem First¶

Reviewing the AST¶

Understanding AST Identifiers¶

Reviewing Lexer Output¶

Reviewing Parser Output¶

Querying the Schema Dictionary¶

Running the Lexer Directly¶

Running the Parser Directly¶

Creating a Simplified, Standalone Testcase¶