Project

General

Profile

Schema Loader

Loading Process Overview

A major feature of the Progress 4GL language is its tight integration between the language syntax and the structure and usage of one or more databases. Database fields can be accessed (read or assigned) directly from expressions and the core language provides features to directly query, iterate and manipulate the database contents. The transaction semantics of the database are tightly coupled with the block structure and processing of the 4GL language. As a result, both the Progress 4GL compiler and the Progress 4GL interpreter are highly coupled to the schema of any database accessed by a given procedure being compiled and/or executed.

For FWD to be able to provide equivalent support for 4GL procedures, it must similarly be just as aware of each database schema being used. Progress provides a mechanism to export a database schema into a text file format. Such schema files normally have a filename that ends in a .df extension. The first step of the conversion front end is to convert the schema files into a form that can be used as a reference or dictionary to the structure, types, names and other configuration of each database.

This .df conversion is done in two steps by a class called com.goldencode.p2j.schema.SchemaLoader. The first step is to parse the .df file into an abstract syntax tree (AST). At its core, a schema AST is a tree data structure that organizes the essential schema structure and information in a manner that is more easily utilized by the conversion process. The SchemaLoader reads the conversion project's global configuration to determine the list of databases that are associated with the project. Each database specified must have a corresponding .df file. One .df file must be designated in the FWD configuration as the metadata file. The metadata is a schema with elements which when used at runtime describes the structure and contents of databases. The metadata .df file is parsed into an AST. Then, each other database's schema is parsed into its own AST.

These ASTs are mostly complete, except that certain directives in the .df file contain information which can only be fully parsed once the structure of the database's schema is available. These are left as unprocessed strings in the first-pass AST.

The second step is to post-process the AST. This post-processing is a second pass to cleanup some parts of the the AST for which parsing was deferred. This deferred parsing is necessary since these parts of the AST can reference schema definitions which may exist later in the .df file. In other words, to properly parse some of the contents of the .df file, the all structural elements of the database must already have been processed. After post-processing, the fixed up AST is saved as the final result of the loading process.

Step 1 - Schema Parsing

The SchemaLoader uses two other classes to do the parsing.

First, the .df file is read as a stream of characters by the com.goldencode.p2j.uast.ProgressLexer class. This lexer converts the stream of characters into tokens. Tokens are language-specific "words" comprised of one or more related characters. For example, the following line from a .df file:

UPDATE DATABASE "?" 

Is converted to this list of tokens:

[00001:001] <KW_UPDATE>                     UPDATE
[00001:008] <KW_DATABASE>                   DATABASE
[00001:017] <STRING>                        "?" 

The ProgressLexer is the same Java class which is used for tokenizing Progress 4GL source code. The .df format is undocumented, but it turns out to be almost a complete subset of the lexing rules as supported by the Progress 4GL. When used for schema processing, the ProgressLexer class enables extra keywords that are not normally valid in 4GL source code. In addition, this schema processing mode causes newline characters embedded in strings to be converted to a space character, which is not what normally occurs in 4GL source. Otherwise the behavior of the lexer is identical and generally the .df file input is a subset of the normal progress 4GL tokens.

Second, a class named com.goldencode.p2j.schema.SchemaParser is used to read the stream of tokens created by the ProgressLexer and to structure related tokens into language-specific sentences in a tree structure. For each .df file that is processed, the output of the SchemaParser is an AST. The logic and grammar for the SchemaParser is generated from an ANTLR 2.7.4 input grammar in the file com/goldencode/p2j/schema/schema.g. The Input Format section of this chapter documents the processing of that parser as implemented in that ANTRL grammar.

Step 2 - AST Post-Processing

In this step, the directives left unprocessed in the previous step are processed into their own ASTs, which are then grafted onto the tree, so as to replace the AST node which previously contained the unprocessed string version. In the first-pass AST, these unprocessed directives are stored as strings. The strings are actually snippets of Progress 4GL source code, which must be run through the Progress preprocessor (to expand any preprocessor statements), and then through the Progress parser (to create a true AST from the snippet of code). However, as the snippets often contain references to schema entities, such as tables and fields, the Progress parser requires a minimally working schema dictionary to resolve these symbols. This is the reason for handling this parsing in a second pass. A copy of the AST for the current logical database (created in the previous step) is merged with the metadata AST, such that the metadata schema elements appear to be elements of the logical database itself (they are, in fact, common elements of every logical database and are referenced as such in Progress 4GL source code). The merged AST is then loaded into a schema dictionary instance, which is embedded into a symbol resolver, which is in turn used by the Progress parser when parsing the code snippets. If a code snippet cannot be preprocessed or parsed successfully, the original string AST node is left in place; otherwise, it is replaced by the expanded AST. These changes are made to the original, logical database AST, not to the copy which was merged with the schema metadata AST (the latter is needed only to initialize the temporary schema dictionary).

The directives deferred to the AST post-processing pass are:

  • VALEXP - field and table validation expressions
  • VIEW-AS - view-as phrases which determine a field's default representation in the UI

In Progress, it is legal to have invalid (i.e., uncompilable) Progress 4GL source code in these directives. Thus, during post-processing of these directives, the Progress preprocessor may encounter includes it cannot resolve, and the Progress parser may report unexpected tokens. Not all of these errors result in an exception, so to be aware of problems during post-processing, it is critical to review the console output generated by this class. These will normally be reported as warnings and it is possible they can be ignored. However, it is also possible that some of these warnings are real features of the schema which are being improperly handled by the FWD schema loader. When the warnings are caused by invalid expressions in the .df (which Progress silently ignores), these same constructs are broken in Progress 4GL too.

After the two-step import is complete, the fixed-up, logical database AST is persisted to XML in a terse format using the general purpose AstManager class.

Input Format

Compatibility

The .df format is completely undocumented. The rest of this chapter maps out the specific inputs that are accepted by the FWD SchemaParser class. Those inputs are expected to be correct or generally correct for most schema dump files originating in Progress v9.1C or earlier. Since there is no authoritative reference to the valid schema dump file format, it is possible that the FWD implementation is incorrect or incomplete.

Later versions of the Progress OpenEdge environment, including v10 are known to have added syntax to this format, which the FWD SchemaParser class does not yet support.

File Encoding

The .df file is a text file with all characters encoded with a code page as defined during the Progress data dictionary schema export. The most common encoding is ISO-8859-1 (Latin 1). Most importantly, this character encoding must match the encoding used in the Java environment during the conversion process. All strings and other character data will be interpreted using this specified code page.

To see the code page used by Progress during export, see the trailer Rule section below.

Top Level File Structure

At its highest level, the .df file is organized into 4 sections, some of which are optional. The following diagram is of the parser's schema rule, which defines the top level parsing of the .df file:

This rule expects to match, in order:

  1. zero or one database construct;
  2. zero or more sequence constructs;
  3. zero or more table constructs;
  4. one trailer construct;

The idea that the database section of the file is optional may seem strange, but in fact, that exact case has been found in valid schema files.

The root of the schema AST will be an artificial DATABASE node (not to be confused with the KW_DATABASE in the database rule). All sub-trees created by the rule references will be attached as child nodes, in the order shown above.

database Rule

This matches a database construct, which specifies the database name. Curiously, the database name seems to always show up as unknown ("?").

The sub-tree created by this rule is rooted at the KW_DATABASE node and has a single child which is the STRING that is the database name. Any KW_ADD or KW_UPDATE token is dropped.

For example:

UPDATE DATABASE "?" 

sequence Rule

Matches a sequence definition, which consists of the ADD SEQUENCE phrase followed by sequence property definitions. The STRING is the sequence name.

The AST is produced rooted at an artificial SEQUENCE node (not to be confused with the KW_SEQUENCE). The KW_ADD, KW_SEQUENCE and the STRING tokens are all dropped from the tree, and sequence name (from the STRING node) will be set as the text for the SEQUENCE node. The children of this node will be the sub-trees created by the rule references for the sequence properties.

For example:

ADD SEQUENCE "sequence-name" 
  INITIAL 0
  INCREMENT 1
  CYCLE-ON-LIMIT no
  MIN-VAL 0
  MAX-VAL 1000000

table Rule

Matches a table definition, which consists of the ADD TABLE phrase followed by table property definitions, some of which are optional, then some number of field and/or index definitions.

The AST is produced rooted at an artificial TABLE node (not to be confused with the KW_TABLE). The KW_ADD, KW_TABLE, STRING, KW_TYPE and SYMBOL tokens are all dropped from the tree, and table name (from the STRING node) will be set as the text for the TABLE node. The children of this node will be the sub-trees created by the rule references for the table properties, fields and indexes.

For example:

ADD TABLE "table-name" 
  AREA "Schema Area" 
  DESCRIPTION "Detailed text about this table." 
  DUMP-NAME "tabname" 

trailer Rule

Match the trailer section of the file. The data contained in this section is not used for FWD conversion purposes, but provides a handy mechanism to detect that the parser has successfully reached the end of the database schema definitions.

The sub-tree created by this rule will be rooted at an artificial TRAILER node. The first child will be the sub-tree from the code_page rule and the second child will be the NUM_LITERAL which represents the byte count for the file.

For example:

.
PSC
cpstream=ISO8859-1
.
0000424350

code_page Rule

Code-page related keywords and a code-page name as a SYMBOL.

The sub-tree created by this rule will be rooted at the KW_PSC node and all following nodes will be it's direct children.

See the trailer Rule for an example.

initial Rule

Match a set of tokens assumed from the caller's context to be the value of the INITIAL property of a sequence or field.

The sub-tree created by this rule is rooted at a KW_INIT and that node will have a single child which is the literal initial value.

If this construct is encountered while parsing the definition for a field of type RAW, the tokens will be discarded and no sub-tree will be created. This is done since it is invalid to initialize a field of that type.

The schema dump file can enclose non-string data values inside double quotes and the ProgressLexer doesn't know when to remove the quotes and when to leave them, since the necessary context is only known to the SchemaParser. The parser must determine this fact based on the data type of the field or the fact that this is a sequence. This results in the INITIAL construct having children that are of the STRING type even when they should be some other kind of literal. To resolve this issue, a second pass at lexing occurs after the parser makes this determination. A helper method is used and the offending token is overwritten. Then the matching continues as it should with no problems and having the correct types. This rewriting does not handle the situation where multiple tokens exist *inside* a single STRING token!

In normal Progress 4GL source, when extent > 0, this same construct can have a list of comma-separated values. No examples have been encountered yet and the code in this rule won't match such conditions.

For an example, see the sequence Rule.

increment Rule

Match a set of tokens assumed from the caller's context to be the value of the INCREMENT property of a sequence.

The sub-tree produced as a result consists of a INCREMENT node representing the property key (the KW_INCR type is replaced) and a single child node with a type of NUM_LITERAL.

For an example, see the sequence Rule.

cycleOnLimit Rule

Match a set of tokens assumed from the caller's context to be the value of the CYCLE-ON-LIMIT property of a sequence.

The sub-tree produced as a result consists of a CYCLE_ON_LIMIT node representing the property key (this replaces the KW_CYCLE token) and a single child node with the value as a literal.

For an example, see the sequence Rule.

minVal Rule

Match a set of tokens assumed from the caller's context to be the value of the MIN-VAL property of a sequence.

The sub-tree produced as a result consists of a KW_MIN_VAL node representing the property key and a single child node with a type of NUM_LITERAL.

For an example, see the sequence Rule.

maxVal Rule

Match a set of tokens assumed from the caller's context to be the value of the MAX-VAL property of a sequence.

The sub-tree produced as a result consists of a KW_MAX_VAL node representing the property key and a single child node with a type of NUM_LITERAL.

For an example, see the sequence Rule.

table_props Rule

Matches all possible table property definitions.

The sub-tree created is rooted at an artificial PROPERTIES node. Each property is parsed in its own rule. All of the sub-trees for those properties are attached as children of the root node, in the same order they are encountered in the file. If no properties are found for a table, this rule doesn't create a sub-tree.

For an example, see the table Rule.

field Rule

This rule matches a field definition, which consists of the ADD FIELD phrase followed by field property definitions, some of which are optional.

An AST is produced as a result, consisting of an artificial FIELD_* node, with its type assigned based on the data type of the field as returned by the as_clause rule (see that rule for the specific types that can be returned). The text of that root node is the field name as parsed from the text of the first STRING token. The KW_ADD, KW_FIELD, STRING, KW_OF and STRING tokens are all dropped from the tree. Likewise, the sub-tree for as_clause is dropped, but as noted above, the data type is used as the root node type. The children of this node will be the sub-trees created by the rule references for the field properties. In the case of the KW_MAND (mandatory) and KW_CASE_SEN (case-sensitive) tokens, the node is directly added as a child of the root (but in the case of KW_MAND it is converted to artificial type MANDATORY). In the case of KW_NOT_CS (not-case-sensitive) and KW_NULLALWD (null-allowed), the token is dropped from the tree.

For example:

ADD FIELD "field-name" OF "table-name" AS character
  DESCRIPTION "A description of the field is here." 
  FORMAT "x(8)" 
  INITIAL "" 
  LABEL "Field Label" 
  HELP "Some user interface help text is here." 
  POSITION 2
  SQL-WIDTH 16
  COLUMN-LABEL "Field Label" 
  ORDER 50
  MANDATORY

index Rule

Matches an index definition, which consists of the ADD INDEX phrase followed by index property definitions, some of which are optional, followed by some number of child entity definitions (field references). The index name is found in the first STRING token and the table name on which the index is based is found in the second STRING token (after the KW_ON).

A sub-tree is produced as a result, consisting of an artificial INDEX node, with an artificial PROPERTIES child node containing the index properties, and an artificial INDEX_FIELD child node for each field of this index. The text of the INDEX node will be set as the index name read from the first STRING token. The KW_ADD, KW_INDEX, STRING, KW_ON and STRING tokens are all dropped from the tree.

For example:

ADD INDEX "index-name" ON "table-name" 
  AREA "Schema Area" 
  UNIQUE
  PRIMARY
  INDEX-FIELD "field-one" ASCENDING
  INDEX-FIELD "field-whatever" ASCENDING

index_props Rule

Index properties, aggregated as children of an artificial PROPERTIES node, which is itself a direct child of the INDEX node. Each matched keyword is a direct child as are the sub-trees created in the referenced rules.

For an example, see the index Rule.

indexField Rule

Matches an index field definition, which consists of the INDEX-FIELD keyword followed by index field property definitions, some of which are optional.

An AST is produced as a result, consisting of an imaginary INDEX_FIELD token node, with child nodes for each property encountered.

For an example, see the index Rule.

literal Rule

Matches a single instance of any and all legal Progress 4GL literals (hard-coded instances of data types, otherwise called a constant). Such constants are properly tokenized by the lexer and this rule simply allows any of the following tokens to be matched:

NUM_LITERAL
DEC_LITERAL
STRING
BOOL_TRUE
BOOL_FALSE
DATE_LITERAL
DATETIME_TZ_LITERAL
UNKNOWN_VAL

Note that the documentation states that there is no such thing as a date constant in Progress 4GL, however experience shows that this is not true.

There are no literal representations for RECID, ROWID or HANDLE data types since these are internal data structures of the Progress language. One can only obtain an instance of such a value from Progress itself so there is no way to hard-code such a reference as a constant.

There is no literal representation for RAW data type.

WARNING: this rule is simply copied from the grammar for the ProgressParser!

The sub-tree created is a single node of the type and text as created by the lexer.

For an example, see the INITIAL property in the sequence Rule.

area Rule

This rule matches a set of tokens assumed from the caller's context to be the value of the AREA property of a table or index.

The resulting sub-tree will be a single artificial node of type AREA with it's text set as the text contents of the STRING node. The actual tokens read by the rule will be dropped.

For an example, see the table Rule.

label Rule

Match a set of tokens assumed from the caller's context to be the value of the LABEL property of a table or field.

The resulting sub-tree will be rooted at the KW_LABEL node and the STRING will be its only child node.

For an example, see the field Rule.

description Rule

Match a set of tokens assumed from the caller's context to be the value of the DESCRIPTION property of a table, field, or index.

The resulting sub-tree will be rooted at an artificial DESCRIPTION node and the text of that node will be set as the text contents of the STRING node. Both tokens being matched will be discarded.

For an example, see the field Rule.

valExp Rule

Match a set of tokens assumed from the caller's context to be the value of the VALEXP property of a Progress field. This results in a VALEXP AST with a single child of type STRING. The string as read from the .df file is enclosed within double quotes; these are removed. The embedded double quotes in this string are escaped by doubling them up. All doubled-up double quotes are replaced with single, double quotes (e.g., ...""some text""... becomes ..."some text"...).

The child node is post-processed if it contains a real Progress 4GL expression.

For example:

  VALEXP "input field-one <> """" and input field-one <> ?" 
  VALMSG "Please enter data into field-one." 

valMsg Rule

Match a set of tokens assumed from the caller's context to be the value of the VALMSG property of a table or field.

The resulting AST will be rooted at an artificial VALMSG node, with the STRING as a child of an artificial EXPRESSION node (which makes the STRING the grand-child node to the VALMSG.

For an example, see the valExp Rule.

as_clause Rule

Match a set of tokens for the AS clause, which defines the data type of a field.

The resulting sub-tree is rooted at the KW_AS node and it will have a single child node which is the data type keyword matched. More importantly, the data type keyword will set the type for the field as follows:

Type Keyword Field Data Type
KW_BLOB FIELD_BLOB
KW_CHAR FIELD_CHAR
KW_CLOB FIELD_CLOB
KW_COM_HNDL FIELD_COM_HANDLE
KW_DATE FIELD_DATE
KW_DATETIME FIELD_DATETIME
KW_DATE_TZ FIELD_DATETIME_TZ
KW_DEC FIELD_DEC
KW_INT FIELD_INT
KW_INT64 FIELD_INT64
KW_LOGICAL FIELD_LOGICAL
KW_RECID FIELD_RECID
KW_ROWID FIELD_ROWID
KW_RAW FIELD_RAW
KW_HANDLE FIELD_HANDLE

This parser is used for .df files which are only used with permanent databases. Since CLASS fields are for temp-tables only, the KW_CLASS cannot be used in a .df file and FIELD_CLASS will not be ever used as a field type in a permanent database.

The following can also be matched, but they are only used in metadata schemas, normal Progress databases don't provide these field types.

Type Keyword Metadata Field Data Type
KW_BIGINT FIELD_BIGINT
KW_BYTE FIELD_BYTE
KW_DOUBLE FIELD_DOUBLE
KW_FIXCHAR FIELD_FIXCHAR
KW_FLOAT FIELD_FLOAT
KW_SHORT FIELD_SHORT
KW_TIMESTMP FIELD_TIMESTAMP
KW_TIME FIELD_TIME

For an example, see the field Rule.

format Rule

Match a set of tokens which is the value of the FORMAT property of a field.

The resulting sub-tree is rooted at a KW_FORMAT node with an artificial EXPRESSION child node and the STRING node is a child of the EXPRESSION.

For an example, see the field Rule.

cruft Rule

Match options that make no sense. This may be related to newer features of v10 which may be forcing all possible options to be output, even those that make no sense for the given type and/or are undefined. The calling code will actually drop the matched subtree.

unknownOption Rule

Match any other unrecognized option and warn that it is being dropped. Everything matched will be dropped. This is used in table, field and index properties to allow a newer schema format to be read without a failure. The dropping of these tokens has no lasting effect, the warning is advisory only.

help Rule

Match a set of tokens as the value of the HELP property of a field.

The resulting sub-tree is the KW_HELP with a child of STRING.

For an example, see the field Rule.

extent Rule

Match a set of tokens as the value of the EXTENT property of a field.

The resulting sub-tree is the KW_EXTENT with a child of NUM_LITERAL.

For example:

EXTENT 10

decimals Rule

Match a set of tokens as the value of the DECIMALS property of a field.

The resulting sub-tree is the KW_DECIMALS with a child of NUM_LITERAL.

For example:

DECIMALS 2

length Rule

Match a set of tokens as the value of the LENGTH property of a field.

The resulting sub-tree is the KW_LENGTH with a child of NUM_LITERAL.

For example:

LENGTH 10

viewAs Rule

Match a set of tokens as the value of the VIEW-AS property of a field. Strangely enough, the KW_VIEW_AS keyword is repeated inside the STRING text. That is just a Progress 4GL quirk.

The resulting sub-tree is the KW_VIEW_AS with a child of STRING.

The string literal child is post-processed to convert it into the full view-as phrase which is compatible with the Progress 4GL source code.

For example:

VIEW-AS "view-as editor size 60 by 10" 

order Rule

Match a set of tokens as the value of the ORDER property of a field.

The resulting sub-tree is an artificial ORDER with a child of NUM_LITERAL. The KW_ORDER is dropped.

For an example, see the field Rule.

maxWidth

Match MAX-WIDTH keyword and a following NUM_LITERAL for a field.

The sub-tree produced as a result consists of a KW_MAX_WID node with a single child node with a type of NUM_LITERAL.

For example:

MAX-WIDTH 10

position Rule

Match a set of tokens as the value of the POSITION property of a field.

The resulting sub-tree has an artificial POSITION node with a child node of NUM_LITERAL. The KW_POS is dropped.

For an example, see the field Rule.

sqlWidth Rule

Match a set of tokens assumed from the caller's context to be the value of the SQL-WIDTH property of a field.

The resulting sub-tree has an artificial SQL_WIDTH node with a child node of NUM_LITERAL. The KW_SQL_WID is dropped.

For an example, see the field Rule.

columnLabel Rule

Match a set of tokens as the value of the COLUMN-LABEL property of a field.

The resulting sub-tree is KW_COL_LAB node with a child node of STRING.

For an example, see the field Rule.

frozen Rule

Match a token assumed from the caller's context to be the value of the FROZEN property of a table.

The resulting AST will be a single node of artificial type FROZEN.

For example:

FROZEN

hidden Rule

Match a token assumed from the caller's context to be the value of the HIDDEN property of a table.

The resulting AST will be a single node of artificial type HIDDEN.

For example:

HIDDEN

dumpName Rule

Match a set of tokens assumed from the caller's context to be the value of the DUMP-NAME property of a table.

The resulting sub-tree is rooted at an artificial DUMP_NAME node and the text contents of the STRING are set as the text of the root node. The KW_DMP_NAME and STRING tokens are discarded.

For an example, see the table Rule.

permissions Rule

Match permissions clause as a property of a table or field.

The resulting sub-tree will have the permission keyword (e.g. KW_CAN_READ as the root node and the STRING as the child node.

For example:

CAN-READ "!,*" 

table_trigger Rule

Match a TABLE-TRIGGER clause in the table_props.

The sub-tree will have the KW_TAB_TRG as the root and the sub-tree created by trigger_parms will be the child.

For example:

TABLE-TRIGGER "Write" OVERRIDE PROCEDURE "write-trigger-procedure.p" CRC "?" 

field_trigger Rule

Match a FIELD-TRIGGER clause in the field rule.

The sub-tree will have the KW_FLD_TRG as the root and the sub-tree created by trigger_parms will be the child.

No example is available at this time.

trigger_parms Rule

Match parameters for a table or field trigger in table_trigger and field_trigger respectively.

Each of the matched tokens will be attached (in order of occurrence) to the tree of the calling rule.

For an example, see the table_trigger Rule.

proc_clause Rule

Matches the PROCEDURE keyword followed by a string literal.

The sub-tree will have the KW_PROC as the root and the STRING will be the child.

For an example, see the table_trigger Rule.

crc_clause Rule

Matches the CRC keyword followed by a string literal.

The sub-tree will have the KW_CRC as the root and the STRING will be the child.

For an example, see the table_trigger Rule.

lobArea Rule

Match LOB-AREA keyword and a following STRING for a field.

The sub-tree produced as a result consists of a KW_LOB_AREA node with a single child node with a type of STRING.

For example:

LOB-AREA "schema-lob" 

lobBytes Rule

Match LOB-BYTES keyword and a following NUM_LITERAL for a field.

The sub-tree produced as a result consists of a KW_LOB_BYTE node with a single child node with a type of NUM_LITERAL.

For example:

LOB-BYTES 10485760

lobSize Rule

Match LOB-SIZE keyword and a following size specification for a field. The size specification is a integral literal with a single letter at the end to specify the units (e.g. 100M). This will appear as a SYMBOL in the resulting tree, although it doesn't naturally lex that way.

The sub-tree produced as a result consists of a KW_LOB_SIZE node with a single child node with a type of SYMBOL. The matched NUM_LITERAL is dropped but the text of that node is prefixed to the text of the SYMBOL node.

For example:

LOB-SIZE 10M

© 2004-2017 Golden Code Development Corporation. ALL RIGHTS RESERVED.