Chapter 6 Parser - FWD - Golden Code Redmine

Parser¶

Overview¶

This chapter just provides an overview of the parser, which will include a high level description of the Progress 4GL source code that can be processed by the parser as well as a list of those known features that are still a problem. The reason that the details of each construct are not documented is twofold: first, the parser is massive since the Progress 4GL language is extremely syntax heavy; and second, Parts 4 through 6 of this book provide detailed syntax diagrams for the syntax that is matched as well as the mapping of the transformation for the resulting converted code. While those parts of the book do not document the syntax of features that are not yet converted properly, this overview should be sufficient to answer the question of what syntax can be parsed.

The parser is implemented in a Java class named com.goldencode.p2j.uast.ProgressParser. It creates an Abstract Syntax Tree (AST) representation of a Progress 4GL source file from an input stream of tokens (provided by the ProgressLexer from that same package). The parser provides a set of methods designed as entry points. The entry points control the type of parsing that will occur. There are 2 valid entry points: external_proc() which is designed to parse an entire (already preprocessed) Progress 4GL source file; and expr() which is designed to parse any valid Progress 4GL expression. In each of these methods, the parser drives the lexer in a loop, obtaining each token in the stream, in order. As it obtains the next token it will make decisions that allow it to match the source code to the valid syntax of the language. Everything that is matched will be converted into a tree-structured data form called an AST. Once it completes parsing the entire program or expression, the entry point returns and the AST is complete. From there the AST can be used for analysis, inspection, modification or transformation purposes. The FWD conversion is dependent upon the source code ASTs to do all of its work.

Please note that this is a generated file using ANTLR and a grammar specified in com/goldencode/p2j/uast/progress.g. The generated parser is an LL type parser. This means that it is top-down parser than recursively descends from top level rules through more and more specific rules until matches are found, in which case the token is consumed from the input stream and the rules return up one level to their calling rule and so on until processing completes a full top level match and returns all the way up to the top level entry point.

The parser is compatible with and is almost feature complete for OpenEdge v10.1B except for those features listed in the Unsupported Features section below.

Supported Features¶

Summary¶

Parts 4 through 6 of this book provide detailed documentation on the supported 4GL input syntax and the transformations supported on output for the conversion. For this reason, this chapter will not provide detailed listings of the supported syntax. As a general rule, if the feature is supported in OpenEdge v10.1B, then this parser will support it. That does not mean that the conversion supports all of those features. The parser was designed to support as much 4GL syntax as possible so that the FWD tools can be used on those language features (for example, to inspect and analyze the source code for reporting purposes) even before the full conversion to Java is available.

Progress 4GL does have a very large number of undocumented features, many of which are already supported by the FWD parser. By its nature, it cannot be known in advance that all of the undocumented features are supported. However, the parser has been used on more than a few large scale applications, which means in all likelihood that many (if not all) of the most common undocumented features are covered.

Block Structure¶

At the top-most level, a Progress 4GL program consists of one or more of the following:

internal procedure definition
class definition
interface definition
trigger definition
function (forward declaration or a definition)
external procedure contents
standalone expression
assignment
non-block language statement
block structure (DO, REPEAT, FOR) language statement (referred to as inner blocks since these must exist inside some other top-level block and they are be nested)

These top-level constructs can be considered the possible top-level blocks of which a 4GL program is comprised.

All other block types are also supported, but these other blocks can only be encountered inside one of the top-level blocks. Object oriented blocks (constructor definition, destructor definition, user-defined method prototype, user-defined method definition or property definition with getter/setter blocks) can be encountered inside class definitions (but not within interface definitions). Trigger definitions can be nested, but other top-level blocks cannot be nested.

All other block types that are possible in Progress 4GL are embedded blocks. These are blocks that are attached or otherwise defined as part of another language statement. Usually, the execution of these blocks is done during the processing of the language statement itself, but sometimes this is in response to some event. EDITING blocks can be defined as part of a PROMPT-FOR, SET or UPDATE language statement. Trigger phrases define a trigger that is naturally scoped to a specific widget. Blocks are also associated with a THEN clause (part of an IF statement or of a larger WHEN clause), an ELSE clause or an OTHERWISE clause.

The normal Progress 4GL language features that can be referenced directly within the external procedure (see above) can also be placed inside any of the block definitions (with the exception of class and interface definitions). The language features referenced include expression processing, assignments, language statements and all of the inner blocks.

All of the naming, labeling, parameterization and options that can be associated with all of these blocks is also fully supported.

Expressions and Assignments¶

The parser supports the full range of all expression processing and assignments that are possible.

All data types are supported. Variable types include CHARACTER, LONGCHAR, LOGICAL, DATE, DATETIME, DATETIME-TZ, INTEGER, DECIMAL, INT64, HANDLE, WIDGET-HANDLE (this type is actually rewritten as a HANDLE), RAW, MEMPTR, RECID, ROWID, COM-HANDLE and CLASS. Temp-table and database field types include CHARACTER, CLOB, BLOB, LOGICAL, DATE, DATETIME, DATETIME-TZ, INTEGER, DECIMAL, INT64, HANDLE, RAW, RECID, ROWID, COM-HANDLE and CLASS. All forms of variable and field references are supported. User-defined object property references are supported.

All forms of array definition and references are supported.

All forms of named variable definitions are supported. This includes DEFINE VARIABLE, DEFINE PARAMETER, the AS or LIKE clause in the format phrase, function parameters, the AS or LIKE in a MESSAGE SET or MESSAGE UPDATE, ASSIGN database triggers (via a trigger procedure statement or the ON statement database event), user-defined method parameters, constructor parameters or a property setter parameter.

All operators are supported.

All assignment syntax including the ASSIGN statement with all of its quirky (and undocumented behavior) is available.

Full built-in attribute and built-in method support for handles and for implicit handles. Implicit handles are objects that can be dereferenced using the COLON operator to use the built-in attributes and methods. This includes variables that are also widgets, widgets that are defined explicitly (e.g. buttons, menus...) and then referenced by name, widgets referenced via qualifier (e.g. BROWSE or FRAME), streams, queries, menus, sub-menus, menu-items, table handles, dataset handles, buffers, tables, temp-tables, datasets, data-sources and data-relations.

All system handles are supported. Other “special” references such as THIS or SUPER are supported.

Invocations of all built-in functions are supported. This includes those built-ins that act like global variables, which have unusual syntax like postfixed function names or which do not use parenthesized parameter lists.

Invocations of all forms of user-defined named blocks (functions, methods) are supported. All forms of object references and chaining is supported, including the NEW phrase.

COM (Active-X) property and method support is fully available.

Language Statements¶

All language statements are fully supported except for the limits noted in the Unsupported Features section below.

Database¶

Schema structure and references are fully supported. This includes all forms of database and temp-tables/work-tables. All query types and other database access and manipulation features are supported.

ProDataSets (datasets, data-sources, data-relations) are fully supported.

Directly embedded SQL is supported.

Object Oriented¶

All 10.1B features for classes, interfaces, methods, properties, packages and data members are supported.

WebSpeed¶

The parser includes "fake" interfaces to duplicate the features that WebSpeed supports. WebSpeed encodes the documented API for WebSpeed as 4GL source code which is included in WebSpeed applications. By pre-loading the definitions for these features directly in the parser, they are treated as built-ins which eliminates the need for the WebSpeed 4GL source code to be present in order to parse and convert WebSpeed applications. Although the WebSpeed code was open sourced and it is safe to convert, this approach avoids the need to do so.

Tree Building¶

The parser builds an abstract syntax tree or AST ("intermediate form representation") of all the recognized tokens and syntax. This AST is designed for optimal subsequent processing during the conversion or with the FWD tree processing tools. The resulting AST matches the block structured nature of the Progress 4GL, including the provision of nesting and the refusal to nest in certain cases (e.g. Progress 4GL does not allow nesting of procedure or function definitions, but it does allow for the nesting of blocks such as do or repeat).

The parser handles all context sensitive aspects of symbol resolution and cross-referencing by name spaces. The AST nodes are annotated with these cross-references to greatly enhance the representation of the code.

Unsupported Features¶

Well Formed 4GL Source Code¶

To the extent that is possible while still generating a well structured tree, the parser has been structured to enforce syntactic correctness. This is typically well implemented in specific language statements (their structure enforces 100% syntactic correctness in more cases or it enforces nearly 100% correctness). Due to ambiguities in the Progress 4GL language and constraints in the ANTLR parser generation, it is not always possible to make the parser's structure (i.e. the manner in which its rules nest and how the matching of each rule processes) enforce syntax checks. For some language features, these checks are implemented using ANTLR's semantic or syntactic predicates and in other cases ANTLR's actions must be used for a completely custom set of checks.

There is a critical assumption currently implemented in the parser: it expects to see well formed (syntactically correct) Progress 4GL source code. While many of the checks that can easily be made are in fact handled, not all syntactical rules are implemented in the parser. So long as the 4GL source code is valid, the parser will work as expected. But it may not fail or report problems with some cases that would cause a failure in the 4GL.

Dropped Input¶

The parser is designed to built a tree-structured representation (AST) of the structure and semantics of each 4GL program. But the parser is not designed to maintain comments, whitespace or “syntactic sugar” which does not contribute to the structure or semantics of the program. This means that there is non-meaningful loss of data between the input (the preprocessed cache file) and the AST. Some tokens are dropped, like the terminating DOT token after a language statement. Some content is reorganized, sorted or structured in a manner that makes it easier or more regular to process, but may no longer represent the input ordering or positioning of the original source code.

For these reasons, the AST cannot be anti-parsed back to the exact original 4GL source code without a loss of non-meaningful elements. The parser could be modified to enable this, but at the present it is not available.

Stored Procedures¶

This support is present but it is not complete.

Stored procedures are schema objects that act like tables in some ways, but have additional behavior as well. The schema support (in the schema dictionary) for such procedures does not exist.

The stored procedure rules are missing the dynamic creation of the special stored procedure names when needed, likewise there are buffer and field definitions that are not dynamically created during parsing (this would depend upon the functionality noted above for the schema).

Progress 4GL has the notion that a stored procedure is like a special kind of table type. The stored procedure is a schema object created in the database supported by a 4GL data server. That schema object is known to the 4GL and it is called with optional parameters (and must be closed when the caller is done using it). To get the data out of the results, Progress creates a buffer of the same name as the procedure, which can be used in record phrases to read the data. Likewise, there are properties defined for the results which are treated like fields. The schema dictionary, symbol resolver and parser all need updates to fully support this new resource. It is possible that 4GL code using these constructs will not parse properly at this time.

There are special purpose names proc-text and proc-text-buffer which can be used as a field (actually it is a text string of the entire current record's contents) and as a buffer respectively. That means that definitions of these need to be created during parsing of the RUN STORED-PROCEDURE statement. Any code following that statement may rely upon that buffer definition to parse properly.

A matching call to CLOSE STORED-PROCEDURE would remove that buffer definition and retire the ability for that definition to be used. There may also be a special closeallprocs procedure that is the equivalent to calling CLOSE STORED-PROCEDURE on all currently open stored procedures.

The parser does not support the above described features.

Database Table vs Temp-Table Name Precedence¶

Schema name lookups are documented to give preference to tables over temp-tables in the case where both share the same name. The support for this has not been tested.

In newer versions of OpenEdge, the DEFINE BUFFER statement now allows the specification of the TEMP-TABLE keyword to force a temp-table name to be preferred over a table of the same name. This is not honored.

Explicit Schema Namespace Support¶

The schema dictionary does not honor the specification of the NAMESPACE-URI or NAMESPACE-PREFIX options in the DEFINE BUFFER or DEFINE DATASET statements. The options will parse, but the schema names will be placed without regard to a specified namespace.

Schema Namespace Buffer Interactions¶

The com.goldencode.p2j.schema.SchemaDictionary is the class that is used to dynamically manage and resolve schema name references that are fully qualified, partially qualified or completely unqualified. Such references to databases, tables and fields must be matched with the proper backing schema definition. This defines data types and many other parser-relevant factors. In addition, the resolved schema element is used to query data that is left behind in AST node annotations, which makes conversion processing easier downstream.

A discussion of the internals of schema name processing is outside the scope of this book. For more details, please see FWD Internals.

Buffer definitions in 4GL cause the namespace to be modified, since a newly created buffer creates a new set of names that can be resolved. Certain references to table names can also affect the namespace in a way that gives priority to recently used tables. These namespaces are maintained using a scoped dictionary, such that more nested scopes will have different name resolution results based on the buffer creation and table references in the current and containing scopes.

The SchemaDictionary currently treats buffer references the same way as a table reference. This usually will work. There is a deviation in some cases where buffer references would not have the same effect as table references. It is suspected that any "no reference" (in buffer scoping terms) would also be a case where there would be a difference. Since there is no documentation on this arcane topic, more investigation is needed to make a determination. If this is the case, there would be a potential issue for the parser's schema name lookups.

As a related but separate issue, if in Progress, the name resolution is dependent upon buffer scoping, then the SchemaDictionary will not work properly. Buffer scoping is not current handled at parsing time. If requirements are found that cause buffer scoping to be a requirement in order to properly resolve all names, then a major rewrite will be needed to both move buffer scoping to a parse-time activity and to use it properly for schema name resolution.

Direct Aggregate Variable Usage¶

Progress 4GL aggregate support seems to be implemented by using hidden variables that are created at the moment of first reference. The hidden or implicit variables can be referenced and assigned directly from user code (without any explicit variable definition code). For example, an integer variable named COUNT can be directly referenced by user code when the COUNT aggregate is in use. These variables, their types and the exact moment of creation need to be investigated and added. This is completely undocumented in Progress, but real application code has been seen that uses the “feature”.

Incomplete AS and LIKE Clause in FORM Items¶

The processing of form items has an incomplete implementation to support inline AS or LIKE clauses. The implementation will not match all possible cases where AS or LIKE appears later in the format phrase (since they can be arbitrarily ordered).

The DEFINE VARIABLE statement adds variable names to the widget namespace when it creates new variables. This is separate from the variable namespace that is also maintained. This widget namespace is needed for method and attribute support to ensure that variables can implicitly be referenced as widgets when needed. If there is no VIEW-AS clause in the DEFINE VARIABLE statement, a default widget type of fill-in is mapped to the variable's name in the widget namespace. Otherwise the widget type specified in the VIEW-AS clause will be used as the mapping to the variable name.

Since RAW and MEMPTR fields cannot be displayed in a frame, it seems that they should be excluded from adding a fill-in definition to the namespace. That is what the parser does for handle types and it may need to be extended if that namespace exhibits a problem due to this usage.

INPUT-VALUE¶

INPUT-VALUE is currently implemented as a synonym for SCREEN-VALUE but this is known to be incorrect. INPUT-VALUE is a widget attribute that basically behaves like the INPUT built-in function, but is specified via the widget:input-value syntax.

KEYS Clause of CHOOSE Language Statement¶

It is not known if the LONGCHAR and/or CLOB types should be allowed in a CHOOSE statement's KEYS clause. At this time these other data types are not supported.

The Progress 4GL documentation suggests that class properties and class data members (variables) share the same namespace. This is implemented but the behavior has not been confirmed with sample code.

Protected Database Class Members¶

Buffers, queries, temp-tables, prodatasets and data-sources can be protected resources of a parent class that are accessed from a child class. This is not implemented at this time and some example code is needed to fully demonstrate the requirements before implementation.

WebSpeed Variables¶

The parser now provides "fake" interfaces to duplicate the features that WebSpeed supports, that support may not have been completely documented in the WebSpeed references and as such it may not be complete.

There are references in the WebSpeed documentation to variable names like SelfURL, AppURL and CGI environment variables but there are no details on the types or usage of these, so at this time these are not yet added but may subsequently be found in real source code.

It is likely that the features of HTML mapping and session management may provide procedures and functions that are not documented but which are used in real source code.

Project

General

Profile

FWD

Wiki

Parser¶

Overview¶

Supported Features¶

Summary¶

Block Structure¶

Expressions and Assignments¶

Language Statements¶

Database¶

Object Oriented¶

WebSpeed¶

Tree Building¶

Unsupported Features¶

Well Formed 4GL Source Code¶

Dropped Input¶

Stored Procedures¶

Database Table vs Temp-Table Name Precedence¶

Explicit Schema Namespace Support¶

Schema Namespace Buffer Interactions¶

Direct Aggregate Variable Usage¶

Incomplete AS and LIKE Clause in FORM Items¶

DEFINE VARIABLE Widget Namespace Maintenance¶

INPUT-VALUE¶

KEYS Clause of CHOOSE Language Statement¶

Protected Database Class Members¶

WebSpeed Variables¶

Project

General

Profile

FWD

Wiki

Parser¶

Overview¶

Supported Features¶

Summary¶

Block Structure¶

Expressions and Assignments¶

Language Statements¶

Database¶

Object Oriented¶

WebSpeed¶

Tree Building¶

Unsupported Features¶

Well Formed 4GL Source Code¶

Dropped Input¶

Stored Procedures¶

Database Table vs Temp-Table Name Precedence¶

Explicit Schema Namespace Support¶

Schema Namespace Buffer Interactions¶

Direct Aggregate Variable Usage¶

Incomplete AS and LIKE Clause in FORM Items¶

DEFINE VARIABLE Widget Namespace Maintenance¶

INPUT-VALUE¶

KEYS Clause of CHOOSE Language Statement¶

Properties and Data Member Namespace Sharing¶

Protected Database Class Members¶

WebSpeed Variables¶