Provides services to extract database schema information from a
Progress schema dump file (
*.df
) for the purposes of
inspection and analysis; combines this data with programmatically and
manually generated conversion hints in order to create an analogous
relational database schema and a skeleton mapping of data-aware Java
objects to the resulting schema.
Author(s)
|
Eric Faulhaber
Greg Shah
Rod Gaither
|
Date
|
October 21, 2005
|
Access Control
|
CONFIDENTIAL
|
Contents
Introduction
Package
Class Hierarchy
High Level Flow
Inputs
Outputs
Schema Name Symbol Resolution
Conversion Considerations
Introduction
The purpose of this package is to inspect and analyze a Progress
database schema in order to manage its conversion to an analogous
relational database schema. The ultimate result of this process is a
working database schema in a specific vendor's relational database.
However, since relational database schema implementations and creation
techniques vary from vendor to vendor, a design principle is to produce
an intermediate, logical mapping which is agnostic to the particular
implementation of the target database. The mapping must be applicable
to
a variety of commercial and free relational databases. It comprises a
primary input into the data conversion process, implemented separately
(package(s) TBD).
This package produces a skeleton, logical mapping of Java objects to
the tables and columns of the resulting relational schema (an Object to
Relational Mapping or "ORM"). This logical mapping is an input to a
separate process (package(s) TBD) whose purpose it is to fully
implement
the ORM data model by creating data-aware Java-objects and Data Access
Objects (DAOs). It is the DAO's role to encapsulate the details of data
access behind a natural object interface.
TBD Note: At the time of writing, this package is
partially implemented. Currently, it provides services to inspect a
Progress schema dump file and to generate a summary report and
namespace
dictionary document. In addition, it provides schema symbol name
resolution services used by the
uast
package. However, the assimilation of hints with this data to produce
a functional relational schema and ORM is not yet implemented.
Package Class Hierarchy
(TBD: update - obsolete)
The following is a high level UML class diagram of the
schema
package in its current implementation:
There are three primary areas of interest:
- The lexing/parsing/tree-walking classes which all implement
SchemaParserTokenTypes
represent the ANTLR-generated classes which are responsible for lexing,
parsing, and understanding the Progress schema dump file.
- The set of
ProgessEntity
subclasses are used to represent the Progress schema in a more
accessible form for further inspection and analysis.
- The
SchemaInspector
interface and its various implementations are used to inspect and
analyze the schema for the purpose of conversion.
TBD: update above diagram to
reflect new namespace resolution classes.
These functional areas are discussed in more detail below.
High Level Flow (TBD: update -
obsolete)
The primary inputs to the schema conversion process are:
- the Progress schema dump file;
- an XML document containing conversion hints.
The recursive descent lexical analyzer and parsing technology known as
ANTLR (
www.antlr.org)
is used to analyze the contents of the Progress schema dump file, based
upon grammar rules which understand the file's syntax. See the
uast
package summary for a useful primer on this technology.
The lexing/parsing/tree-walking pipeline operates as follows:
- The Progress schema dump file is fed to the
SchemaLexer
, which generates on
output a stream of tokens specific to the dump file format.
- This stream of tokens is input to the next stage in the pipeline,
the
SchemaParser
. This
class subjects the stream of tokens to a set of grammar rules which are
used to generate an Abstract Syntax Tree (AST). This is a tree of
token-based nodes which abstracts away the dump file syntax and
produces
a normalized, hierarchical representation of the dump file contents.
- The AST is input into the next stage of the pipeline, the
SchemaTreeWalker
. This
object walks the tree of nodes, and as it encounters tokens which
represent logical schema entities (e.g., database, table, field, etc.),
it constructs Java objects which represent those entities in their
proper context with respect to one another (e.g., database contains
tables, table contains fields, etc.).
The end result of the parsing pipeline is a hierarchy of Java objects
representing the database and each of its sequences, tables, fields and
indexes. This hierarchy is well suited to be inspected generically by
classes which implement the
SchemaInspector
interface.
The primary outputs of the schema conversion process are:
- a cross-referenced Progress to Relational ("P2R") mapping
document;
- a dictionary of the Progress schema namespace (see
NamespaceInspector
);
- a set of XML mapping documents used by the open source ORM
technology Hibernate (www.hibernate.org)
to generate and apply the resulting schema to an existing, relational
database;
- a functional schema applied to a relational database.
Secondary outputs are:
- various reports;
- a log file (for debugging purposes).
Inputs
Progress Schema Dump File
This file is programmatically generated in a well-structured and
consistent format by the Progress data dictionary. Its syntax consists
of a series of phrases which describe various, logical schema
constructs, organized hierarchically: the database, its sequences, its
tables, their fields and indexes. A snippet of a hypothetical customer
service database follows:
UPDATE DATABASE "?"
ADD SEQUENCE "next-cust-num"
INITIAL 0
INCREMENT 1
CYCLE-ON-LIMIT no
MIN-VAL 0
ADD TABLE "customer"
AREA "Schema Area"
DESCRIPTION "Customer Data"
VALMSG "SERVICE APP 1.0"
DUMP-NAME "customer"
ADD FIELD "acct-num" OF "customer" AS integer
FORMAT "9"
INITIAL "0"
LABEL "Account Number"
POSITION 2
SQL-WIDTH 4
ORDER 10
ADD FIELD "first-name" OF "customer" AS character
FORMAT "X(12)"
INITIAL ""
LABEL "First Name"
POSITION 10
SQL-WIDTH 130
ORDER 20
ADD FIELD "last-name" OF "customer" AS character
FORMAT "X(12)"
INITIAL ""
LABEL "Last Name"
POSITION 10
SQL-WIDTH 130
ORDER 30
ADD FIELD "since" OF "customer" AS date
FORMAT "99/99/9999"
INITIAL ?
LABEL "Customer Since"
POSITION 5
SQL-WIDTH 4
ORDER 40
ADD INDEX "idx-acct" ON "customer"
AREA "Schema Area"
UNIQUE
PRIMARY
INDEX-FIELD "cust-num" ASCENDING
Each phrase provides the name of the entity it describes, as well as
the properties, if any, which have been defined for it. Entities are
always defined after the entity which logically contains them. For
example, table fields and indexes are always defined after the table
which contains them.
Hint Document(s)
Hints are an important input into the database conversion process,
since the information provided by the Progress schema dump file is not
a
one for one match with the information needed to produce a relational
schema. In particular, critical information regarding the data
integrity
relations between tables and subtle cues needed for proper conversion
of
types is missing. This information must be gathered by automated
and/or
manual code analysis, and possibly by an analysis of the actual data
residing in
the Progress database, to determine where standard type conversion may
not apply.
Hints are structured in an XML grammar (TBD). Hints are loaded by
the
SchemaHintEngine
(TBD). This engine makes hints accessible to the classes which require
them in order to produce their respective
output files.
Outputs
Progress Schema
Namespace Dictionary (TBD: update - obsolete)
This document is used by the code conversion process to enable a schema
namespace lookup to resolve unknown variables to database field
references. It is produced by the
NamespaceInspector
and
requires no hints, since it is based completely on the Progress schema
dump file.
<?xml version="1.0"?>
<!--Progress schema namespace dictionary-->
<!--Generated Tue Nov 30 10:51:29 EST 2004-->
<schema source="junk.df">
<database name="custsvc">
<table name="customer">
<field name="acct-num" type="integer" />
<field name="first-name" type="character" />
<field name="last-name" type="character" />
<field name="since" type="date" />
</table>
</database>
</schema>
This information is used by the
NamespaceLoader
to load entries into the
SchemaDictionary
,
in order to enable
symbol
resolution of schema entity names during Progress source code
parsing.
Hibernate Mapping
Documents
These XML files are produced by the
HibernateMappingInspector
(TBD), based upon the Progress schema, with a strong reliance upon
hints
about field type conversions and table relations. Alternately, they may
be produced by a set of classes which uses the P2R mapping document as
an input (TBD).
These documents are inputs to the
org.hibernate.tools.hbm2ddl.SchemaExport
tool provided with the Hibernate toolset. This tool produces data
definition language ("DDL"), which is applied against the target
relational database to create the schema. The DDL optionally can be
saved off separately for reference and debugging purposes, batch use or
database recovery.
By using Hibernate's mapping document format and syntax, we gain the
advantage of a set of abstract data types and an XML grammar which can
be applied across a wide range of relational databases. Since Hibernate
is aware of numerous database "dialects", the same mapping documents
can
be used regardless of the target database.
P2R Mapping Document (TBD:
update - obsolete)
This XML document describes how the Progress schema is mapped to a
relational schema. It is used to create a cross-linked reference
document for end-user use and may provide an input into the creation of
Hibernate mapping documents if this information is not compiled
directly
by an inspector implementation (TBD).
The grammar is essentially divided into two sections: one which
describes a "forward" mapping of Progress schema constructs onto
relational constructs, and the other which provides an inverse view of
this mapping. An early prototype of this grammar follows (whitespace
added for clarity):
<?xml version="1.0"?>
<!-- Progress/Relational schema cross-referenced mapping -->
<schema>
<!--Progress schema-->
<progress source="custsvc.df">
<database name="custsvc" ref="CUSTSVC">
<sequence name="next-cust-num" ref="nextCustNum">
<property name="INITIAL" value="0" />
<property name="INCREMENT" value="0" />
<property name="CYCLE-ON-LIMIT" value="no" />
<property name="MIN-VAL" value="0" />
</sequence>
<table name="customers" ref="CUSTOMERS">
<property name="AREA" value="Schema Area" />
<property name="DESCRIPTION" value="Customer service database" />
<property name="VALMSG" value="SERVICE APP 1.0" />
<property name="DUMP-NAME" value="customer" />
<field name="acct-num" ref="ACCT_NUM" type="integer">
<property name="FORMAT" value="999999" />
<property name="INITIAL" value="0" />
<property name="LABEL" value="Account Number" />
<property name="POSITION" value="2" />
<property name="SQL-WIDTH" value="4" />
<property name="ORDER" value="10" />
</field>
<field name="first-name" ref="FIRST_NAME" type="character">
<property name="FORMAT" value="X(12)" />
<property name="INITIAL" value="" />
<property name="LABEL" value="First Name" />
<property name="POSITION" value="10" />
<property name="SQL-WIDTH" value="130" />
<property name="ORDER" value="20" />
</field>
<field name="last-name" ref="LAST_NAME" type="character">
<property name="FORMAT" value="X(12)" />
<property name="INITIAL" value="" />
<property name="LABEL" value="Last Name" />
<property name="POSITION" value="10" />
<property name="SQL-WIDTH" value="130" />
<property name="ORDER" value="30" />
</field>
<field name="since" ref="SINCE" type="date">
<property name="FORMAT" value="99/99/9999" />
<property name="INITIAL" value="?" />
<property name="LABEL" value="Customer Since" />
<property name="POSITION" value="5" />
<property name="SQL-WIDTH" value="4" />
<property name="ORDER" value="40" />
</field>
<primary-key>
<field name="acct-num" />
</primary key>
</table>
<table name="invoices" ref="INVOICES">
<property name="AREA" value="Schema Area" />
<property name="DESCRIPTION" value="Invoices for customer orders" />
<property name="VALMSG" value="SERVICE APP 1.0" />
<property name="DUMP-NAME" value="invoices" />
<field name="invoice-num" ref="INVOICE_NUM" type="integer">
<property name="FORMAT" value="999999" />
<property name="INITIAL" value="0" />
<property name="LABEL" value="Invoice Number" />
<property name="POSITION" value="2" />
<property name="SQL-WIDTH" value="4" />
<property name="ORDER" value="10" />
</field>
<field name="cust-num" ref="CUST_NUM" type="integer">
<property name="FORMAT" value="999999" />
<property name="INITIAL" value="0" />
<property name="LABEL" value="Customer Account Number" />
<property name="POSITION" value="2" />
<property name="SQL-WIDTH" value="4" />
<property name="ORDER" value="20" />
</field>
<primary-key>
<field name="invoice-num" />
</primary key>
<foreign-key table="customers">
<field local="cust-num" foreign="acct-num" />
</foreign-key>
</table>
</database>
</progress>
<!--Relational schema-->
<relational>
<database name="CUSTSVC" ref="custsvc">
<table name="CUSTOMERS" ref="customers">
<column name="ACCT_NUM" ref="acct-num" type="integer">
<property name="format" value="000000" />
</column>
<column name="FIRST_NAME" ref="first-name" type="string" />
<column name="LAST_NAME" ref="last-name" type="string" />
<column name="SINCE" ref="since" type="date">
<property name="format" value="MM/dd/yyyy" />
</column>
<primary-key>
<column name="ACCT_NUM" />
</primary key>
</table>
<table name="INVOICES" ref="invoices">
<column name="INVOICE_NUM" ref="invoice-num" type="integer" />
<column name="cust-num" ref="CUST_NUM" type="integer" />
<primary-key>
<column name="INVOICE_NUM" />
</primary key>
<foreign-key table="CUSTOMERS">
<column local="CUST_NUM" foreign="ACCT_NUM" />
</foreign-key>
</table>
</database>
</relational>
</schema>
Each database, table, and field entity within the
<progress>
element has an analog within the <relational> element. The
ref
attribute in each of these tags refers to the name of that entity's
analog. Progress properties which do not change in the conversion are
not represented on the relational side, as these values can be looked
up
as necessary. Properties that do change (e.g., format clauses) or that
are necessarily different on the relational side (e.g., data type) are
represented by an appropriate tag or attribute.
TBD Note: the above grammar
needs to be extended to capture additional information regarding the
specific nature of relations between tables, and to accommodate
specific
refactoring scenarios, as discussed in the
Hint
Document(s) section.
Secondary Outputs
Summary Report. This is a
very simple overview report generated by the
SummaryInspector
. Its
primary purpose is to give a more compact overview of the Progress
schema than is easily gleaned from reading the dump file. This report
is
useful for early analysis of a Progress schema.
Incident Report (TBD). An
output from the conversion process which reports any problems or
oddities encountered during conversion. This might include invalid type
conversions, duplication of entity names or other naming conflicts,
orphaned hints (hints which could not be applied, indicating possible
mismatches and invalid results).
Log File (TBD). A
detailed activity log for debugging purposes.
Schema Name Symbol
Resolution
As the UAST parses Progress source code, it will encounter
references to Progress database schema entities, such as databases,
tables (records), and fields. Progress is extremely flexible in the way
these references can be written, in terms of allowing names to be
abbreviated in certain cases, and requiring qualifiers to "parent"
entities only when necessary to avoid ambiguity. This flexibility
requires a somewhat complex mechanism for symbol resolution of
database, table, and field references. The following naming conventions
for schema entities exist in Progress, and must be supported in P2J's
symbol resolution facility:
<database_name>.<table
_name
>.<field
_name
>
(a fully qualified field reference)
<database
_name
>.<table
_name
>
(a fully qualified table reference)
<table
_name>.
<field_name>
(a partially qualified field reference)
<database
_name
>
(a
unique reference to a database)
<table
_name
>
(an
unqualified reference to a table)
<field
_name
>
(an
unqualified reference to a field)
To further complicate matters, table names and field names may be
abbreviated, so long as the abbreviated form of the name does not cause
ambiguity between two or more possible name matches. Additionally,
multiple block scopes must be considered. A more complete discussion of
the rules of schema name references in Progress source code as it
relates to schema entity symbol resolution is available in the
uast
package summary.
Scopes and the Schema
Namespace Hierarchy
The
SchemaDictionary
is responsible for loading schema namespace information, for managing
multiple scopes, and for providing an API to resolve schema names. A
default lookup scope is created upon construction of this object. It
relies upon the
SchemaConfig
to determine which, if any, databases it should load by default into
this outermost scope. Calling code may create additional scopes "on
top" of this scope and load arbitrary database names, table names, and
field names into these scopes to augment the default lookup behavior,
as blocks are traversed during source code parsing.
In a namespace lookup, more recently added
scopes will always be searched before older scopes. Thus, the
default scope added by the
SchemaDictionary
constructor
will always be searched last.
The schema namespace hierarchy is organized as
follows:
Each database, table, and field depicted in the diagram above is
represented as a
NameNode
object. Each
NameNode
contains a
Namespace
object representing that entity's private namespace for its child
entities. For instance, a database
NameNode
has a private
namespace of table nodes
; a table
NameNode
has a private namespace of field nodes
;
a field
NameNode
has an empty namespace.
There are three layers of global namespaces:
- the Global Database Namespace containing all databases;
- the Global Table Namespace containing all tables across all
databases;
- the Global Field Namespace containing all fields across all
tables across all databases.
A combination of global and private namespaces are used to perform a
name symbol search.
The Search Algorithm
A schema namespace hierarchy such as the one depicted above exists in
each scope defined within the
SchemaDictionary
. These
scopes are implemented as a stack; calling code can push a new logical
scope onto the top of the stack with
SchemaDictionary
's
addScope
method as it enters a new scope during parsing. Later, it should pop
the logical scope off the top of the stack with the
removeScope
method, as it exits the corresponding scope in Progress source code.
Scopes are searched from the top of the stack downward. Either one or
two complete passes through all scopes are performed. The first pass
looks only for exact matches. If and only if all scopes have been
searched and no exact match has been found, then the second pass is
performed. The second pass permits abbreviated names to match.
Given a name reference using one of the naming conventions listed
above, symbol resolution begins at the topmost (i.e., outermost or
broadest), logical scope with a search at the first global namespace
for which a component exists in the name. If no match is found at this
level, the lookup within the current scope has failed. If a matching
node is found, however, a search for the next component of the name in
the matching node's private namespace is performed, and so on, until
either a namespace level search fails, or the entire name symbol has
been resolved. If no match is found at all within the current scope,
the next scope in the stack is searched, and so on, until the name is
resolved, or there are no more scopes to search.
Within each scope then, if we are searching for a name in the form
<database_name>.<table
_name
>.<field
_name
>
,
for instance (using names from the diagram above),
DB_2.Table
_
1.F2
,
we first search for
DB-2
in the Global Database
Namespace. Upon finding the name
DB_2
, we look next for
the name
Table_1
in the DB_2 Private Table Namespace.
Upon finding the name
Table_1
, we look next for the name
F2
in the DB_2
.Table_1 Private
Field Namespace. Upon finding the name
F2
, we return the
NameNode
found. If the
DB_2
qualifier component of the name were
missing, the search would begin at the Global Table Namespace level. If
both the
DB_2
and the
Table_1
qualifier
components of the name were missing, the search would begin (and end)
at the Global Field Namespace level. This portion of the algorithm is
implemented in the
SchemaDictionary.Scope.findNode
method.
The search at each namespace level takes place in
Namespace
's
find
method.
Namespace
contains a
java.util.List
of
NameNode
objects which is sorted lazily when a search
is requested. The binary search implemented
by
java.util.Collections
is used. For database name searches, this
binary search uses
a
comparator which requires an
exact match of all characters in the name component being resolved
(since database symbol names may not be abbreviated). For table and
field name searches (for which abbreviated symbol names are allowed),
both exact matches and abbreviated name matches are permitted. The
abbreviated name binary search is
performed using a
comparator
which is more lenient: it permits comparisons between a full name
and a matching abbreviation to test as equivalent. Only one type of
search -- exact match or abbreviated match -- is performed at a time by
Namespace
's
find
method.
If no match is found in a namespace search, the symbol
resolution has failed
for the current
Namespace
.
If a single match is found, it has
succeeded. However, ambiguous matches must also be considered. To
handle this, a successful match is followed up by a test of the
NameNode
s
which immediately surround the matching node in the namespace's
NameNode
pool. If more than one match is found, the resolution is ambiguous, and
an
AmbiguousSchemaNameException
is thrown. This exception contains a list of all the matching
NameNode
objects.
Conversion Considerations
Schema Property Mappings
Many properties defined in a Progress schema have a fairly clean
mapping to a similar construct in a relational schema. Others do
not. The following table summarizes the disposition of these
properties with regard to schema conversion. Italicized Progress
property names are implicit; others are keywords from the DF
export file.
Progress
Property
|
Context
|
Purpose
|
P2J
Use
|
NAME
|
database
|
Logical name of the database.
|
Always appears as "?" in the DF
file, so this value is provided in the P2J configuration.
|
NAME |
sequence
|
Logical name of the sequence.
|
TBD
|
INITIAL
|
sequence
|
Initial value of the sequence.
|
TBD
|
INCREMENT |
sequence
|
Amount by which sequence is
incremented to generate a new, unique value.
|
TBD
|
CYCLE-ON-LIMIT |
sequence
|
Whether sequence wraps when it
reaches its limit.
|
TBD
|
MIN-VAL |
sequence
|
Minimum sequence value.
|
TBD
|
NAME
|
table
|
Name of the table.
|
Translated to replace invalid
characters. Becomes basis for relational table name and Java data
model class name. Value may be overridden by hints.
|
AREA
|
table
|
Progress housekeeping - table's
storage area.
|
Discarded.
|
DESCRIPTION
|
table
|
Table description. This
value does not appear to be used at runtime, but is rather for the
convenience of developers and administrators.
|
Comment in the hibernate
configuration files where table is defined.
|
VALEXP
|
table
|
Validation expression to be
applied against record deletions for this table. This expression
must ultimately evaluate to true or false, but it can
contain any valid Progress code, including preprocessor file includes. |
Converted into valid Java
code and triggered on record deletes, possibly via a stored procedure.
Implementation TBD.
Note: There are no instances of table-level validation
expressions in any of the schema dumps we have processed.
|
VALMSG |
table
|
Message to display to user if
validation expression fails.
|
Converted to a Java string
resource and displayed to the user upon a failure of the converted
(Java) validation expression.
Implementation TBD.
|
DUMP-NAME |
table
|
Progress housekeeping - name of
file to which table definition is dumped on export.
|
Discarded. It is assumed
third party tools will be used to manage the relational database after
conversion.
|
FROZEN |
table
|
Progress housekeeping -
indicates table definition is read-only in data dictionary.
|
Discarded. |
HIDDEN |
table
|
Indicates whether table is
hidden. The implications of this are currently unknown.
|
TBD
|
NAME
|
field
|
Name of the field.
|
Translated to replace invalid
characters. Becomes basis for relational column name, Java data
model instance variable name, and names of accessor/mutator methods in
data model class. Value may be overridden by hints.
If no field LABEL property is provided, the unmodified Progress field
name is stored as this column's label (see below).
|
OF |
field
|
Name of table for which this
field is defined; a housekeeping vestige of the DF file format.
|
Used as a reality check during
DF file parsing only. Discarded thereafter, since a column's
containing table is implicit.
|
AS |
field
|
Data type of field.
|
Converted to the appropriate
P2J data wrapper type (see com.goldencode.p2j.util )
or Hibernate type, from which the appropriate SQL type and Java data
type
are inferred. The mapping is as follows:
Progress
Type
|
P2J Type
|
Hibernate
Type
|
integer
|
integer
|
|
decimal
|
decimal
|
|
character
|
character
|
|
logical
|
logical
|
|
date
|
date
|
|
raw
|
|
binary
|
recid - TBD
|
|
|
rowid - TBD
|
|
|
handle - TBD
|
|
|
N/A
|
|
integer (surrogate primary
key)
|
|
FORMAT |
field
|
Default format string used to
display field's value in UI.
|
Translated to a compatible Java
format string and used when updating the UI. Implementation TBD.
|
LABEL |
field
|
Label which appears by default
when this field appears in UI and in printed reports.
|
Stored verbatim and used as the
field's default label in the UI and in printed reports.
Implementation TBD.
This information is used to help generate XML comments in the Hibernate
mapping documents and JavaDoc comments in the Java Data Model Object
class definitions.
If no label is provided, the Progress field's name (unmodified), must
be used as its label, to maintain a compatible look and feel.
|
POSITION |
field
|
Unknown. Possibly an
indicator of the order in which fields are displayed in the table
editor of the data dictionary.
|
TBD
|
SQL-WIDTH |
field
|
Indicates width for visual
representation of a field; highly related for format
phrase. For fields with extent > 0, this value must first be
divided by the extent size.
|
Stored and used in calculating
the length in bytes of a character or binary column, but only if larger than the length
calculated using a statistical analysis of existing data for a given
column.
|
ORDER |
field
|
Default order in which the UI
widget which displays this field's value is drawn in relation to other
fields' widgets, when multiple fields from a table are displayed.
Also the order in which table fields are exported to Progress export
files. The order assigned to a given field is compared with that
assigned to other fields in the table to determine the current field's
relative position (in ascending order).
|
Stored and used to reorder
properties within a class in the P2O schema document. The order
is critically important when importing data from Progress export files,
as the data must be read from these files in this order.
|
EXTENT |
field
|
If greater than 0, indicates the
field holds an array of values of the field's data type, rather than a
single value of that type. The value of the extent is the size of
the array.
|
Array values are stored in a
secondary table. The secondary table contains a foreign key
column related to the primary key of the table containing the array
column. An index column in the secondary table maintains the
appropriate indexing of the array values. If multiple columns in
the same table use the same extent value, they are all be stored in the
same, secondary table, each in a separate column. Otherwise, a
separate, secondary table exists for each occurrence of a new extent
value within the same Progress table.
The number of rows in a secondary table equals the number of rows in
the primary table times the size of the extent which the secondary
table manages. For example, a table containing 1000 rows which
manages four fields, each with an extent of 12 will be related to a
secondary table containing (1000 * 4 * 12), or 48,000 rows.
The data model object which maps to a table containing array columns
provides access methods to make this backing implementation transparent.
See also Converted Table Structure.
|
DECIMALS |
field
|
|
|
COLUMN-LABEL |
field
|
Text which overrides LABEL when
the field is displayed in columns.
|
Stored verbatim and used as the
field's default column label in the UI and in printed reports.
Implementation TBD. |
INITIAL |
field
|
Initial value of the field.
|
Translated into a value
appropriate for use in Java. Used to initialize the instance
variable which corresponds with this field upon construction of the
appropriate Java data model class. If no initial value is
defined, use the default for the field's data type.
|
HELP |
field
|
Help text to display to user
when the UI widget which represents this control gains input focus.
|
TBD.
|
MANDATORY |
field
|
Indicates whether the field may
contain the unknown value. If set, the unknown value is
disallowed.
|
Maps to NOT NULL or the
appropriate equivalent.
|
CASE-SENS
|
field
|
Indicates the field is case
sensitive (it is not, by default). This refers to the field's
data, not its name.
|
TBD.
|
VALEXP |
field
|
Validation expression to be
applied against record-level inserts and updates when the current
field's value is saved from the record buffer to the database.
This expression
must ultimately evaluate to true or false, but it can
contain any valid Progress code, including preprocessor file includes. |
TBD.
|
VALMSG
|
field
|
Message to display to user if
field's validation expression fails.
|
TBD.
|
VIEW-AS
|
field
|
Overrides the default widget
type used to display the current field in the UI.
|
TBD.
|
NAME
|
index
|
Logical name of the index.
|
Translated into an
SQL-compatible name and used as the name of an analogous index defined
on the relational table associated with the Progress index' table.
|
ON
|
index
|
Name of the table for which the
index is defined.
|
Used as a reality check during
DF file parsing only. Discarded thereafter, since an index'
containing table is implicit. |
UNIQUE
|
index
|
Indicates that the field or
fields comprising the index must be unique for each row. In the
case of multiple field indexes, the combination of all fields in the
index must be unique.
|
Translated to a unique
constraint which is applied to the table for which the index is defined.
|
PRIMARY
|
index
|
Indicates that the index is used
as the default for query purposes.
|
TBD.
|
WORD
|
index
|
Indicates that the index is a
lexical index on the individual words in the values of a character
field.
|
Not used in initial
project; implementation will be deferred.
|
DIRECTION |
index field
|
Indicates sort direction for the
index: ascending or descending.
|
TBD.
|
ABBREVIATED
|
index field
|
TBD.
|
TBD.
|
Data Types (TBD - obsolete)
Hibernate defines a set of database vendor-neutral data types, which
are mapped to specific database dialects transparently by the Hibernate
framework. Most P2R type mappings can be handled by default conversions
(e.g.,
character
becomes the Hibernate
string
type,
integer
remains
integer
, etc.).
Some type mappings may require non-default conversions. For instance, a
field of type
character
which in practice contains
very
lengthy data must be mapped to type
text
(which
ultimately
is mapped by Hibernate to a database vendor-specific type such as
LONGVARCHAR
).
In the event of refactoring a single field to multiple columns, or
multiple columns to a single field, type is represented in the hints
document by a non-standard identifier. This causes the conversion logic
to look up a class to handle the mapping in a custom way (TBD).
For instance, there is no
time
type in Progress, but
there
is a Hibernate
time
data type. For consistent use of time
information across tables, it may make sense to refactor all
multi-field
time representations from a Progress schema into a single relational
database column defined as Hibernate's
time
type. A
hypothetical set of four Progress
integer
fields
representing hours, minutes, seconds, and milliseconds would in this
case be mapped into a single column of type
time
. This
would result in a special type descriptor (perhaps an XML attribute of
type="specialtime1"
)
being defined in a hint and backed by a custom converter class.
Converted Table Structure
The conversion process of a Progress database table to a relational
database table is not necessarily a one-to-one affair. While
simple fields will map directly to a counterpart of similar data type
on the other side, the Progress feature of
extent fields (i.e., data columns
which actually represent an array of data values of any of the
supported data types), calls for some refactoring during conversion for
certain tables.
A single Progress table will convert to a primary table in the target
database, with zero or more secondary tables in which the extent fields
are mapped. Fields of like extents in the Progress table are
removed from the primary table and are re-mapped to a common secondary
table. A separate secondary table is defined for each grouping of
like extents. The secondary table is associated with the primary
table by a foreign key which references the primary table's
surrogate, primary key column. The
naming convention for such a secondary table is
<converted
table name>__<extent size>
.
For example, consider a Progress table
vendor
which
contains several simple fields (
name
and
address
),
as well as several fields (
order
,
invoice
,
and
widget
), which have extents greater than 0.
Table vendor
|
character name
|
character address
|
integer[5] order_number
|
character[5] invoice
|
character[3] widget
|
This table is refactored during conversion into a primary table and two
secondary tables as follows.
Table vendor (primary)
|
|
Table vendor__3 (secondary)
|
|
Table vendor__5 (secondary)
|
integer primary_key
|
|
integer parent__id (references
vendor.primary_key)
|
|
integer parent__id (references
vendor.primary_key)
|
varchar name
|
|
varchar widget
|
|
integer order_number
|
varchar address
|
|
integer list__index
|
|
varchar invoice
|
|
|
|
|
integer list__index
|
Note: the data types
indicated in the converted table are representative only, as data type
names vary among database vendors. Widths are intentionally
omitted to simplify the example.
Each secondary table contains a multiple of the number of rows in table
vendor
, where the multiplier is the extent size
represented by that secondary table. So, for instance, if table
vendor
has 50 rows, table
vendor__3
has 150 (50 * 3) rows, and
table
vendor__5
has 250 (50 * 5) rows. Each
secondary table has a foreign key
parent__id
which
references
primary_key
in table
vendor
.
A secondary table's primary key is a composite of
parent__id
and
list__index
, where the value of the latter represents
a record's zero-based index of its position in the original Progress
extent field's array. In other words, the data at each index in
the original extent field is distributed among
<extent size> records in the
new, secondary table, as the following example indicates. The
order of the data values in the original, Progress extent field is
strictly maintained.
Record
in Original Progress Table vendor
name |
address
|
order_number
|
invoice
|
widget
|
Vendor ABC
|
123 Main Street, Anytown, USA,
etc...
|
5
22
86
87
190
|
65-ABGD-001
78-FFFE-005
12-UIOP-002
56-B8TY-099
72-MNQW-62
|
W8
W3
W6
|
Corresponding
Records in Converted Tables
Table vendor
primary_key
|
name
|
address
|
199
|
Vendor ABC
|
123 Main Street, Anytown,
USA, etc... |
|
Table vendor__3
parent__id
|
widget
|
list__index
|
199
|
W8
|
0
|
199
|
W3
|
1
|
199
|
W6
|
2
|
|
Table vendor__5
parent__id
|
order_number
|
invoice
|
list__index
|
199
|
5
|
65-ABGD-001 |
0
|
199
|
22
|
78-FFFE-005 |
1
|
199
|
86
|
12-UIOP-002 |
2
|
199
|
87
|
56-B8TY-099 |
3
|
199
|
190
|
72-MNQW-62 |
4
|
|
Data Model Objects
Converted tables are mapped via Hibernate to Data Model Object (DMO)
classes. DMO classes are relatively simple, Java bean-like
constructs which generally have:
- instance variables of a P2J data wrapper type (i.e., from
com.goldencode.p2j.util
)
to represent each simple field;
List
s of these wrappers to represent an extent field
with an extent greater than 0 (in the case there is only one such
field); and
- static, composite inner classes to represent multiple columns
with like extents greater than 0.
The complexity of these internals (and of their mapping to the
corresponding database table) is generally hidden by Hibernate and the
public API of the DMO class. Simple getter and setter methods are
provided for each simple instance variable; indexed getter and
setter methods are provided to access data corresponding with extent
fields.
The following example, which builds off the
vendor
table
in the
previous section,
illustrates the core of the DMO class definition (sans JavaDoc
comments). This example assumes the initial value is specified as
the unknown value (
?
) in the Progress schema for each of
the table's fields.
import java.util.*; import com.goldencode.p2j.util.*;
public class Vendor { private Integer primaryKey = null;
private character name = new character();
private character address = new character();
private List widget = new ArrayList(3);
private List composite5 = new ArrayList(5);
public Vendor() { for (int i = 0; i < 3; i++) { widget.add(new character()); }
for (int i = 0; i < 5; i++) { composite5.add(new Composite5()); } }
public character getName() { return name; }
public void setName(character name) { this.name = name; }
public character getAddress() { return address; }
public void setAddress(character address) { this.address = address; }
public character getWidget(int index) { return (character) widget.get(index); }
public void setWidget(int index, character element) { widget.set(index, element); }
public integer getOrderNumber(int index) { return ((Composite5) composite5.get(index)).getOrderNumber(); }
public void setOrderNumber(int index, integer element) { ((Composite5) composite5.get(index)).setOrderNumber(element); }
public character getInvoice(int index) { return ((Composite5) composite5.get(index)).getInvoice(); }
public void setInvoice(int index, character element) { ((Composite5) composite5.get(index)).setInvoice(element); }
static class Composite5 { private integer orderNumber = new integer();
private character invoice = new character();
public integer getOrderNumber() { return orderNumber; }
public void setOrderNumber(integer orderNumber) { this.orderNumber = orderNumber; }
public character getInvoice() { return invoice; }
public void setInvoice(character invoice) { this.invoice = invoice; } } }
|
Table
vendor
maps to a DMO class named
Vendor
.
The vendor's
name
and
address
properties
are accessible via simple getter and setter methods. The
properties associated with extent fields (
widget
,
orderNumber
,
and
invoice
) are accessible via specialized getter and
setter methods which accept an
index
parameter. The
index
parameter is zero-based; therefore it must
fall within the range
0 <=
index
< extent size
An
ArrayIndexOutOfBoundsException
will be thrown for
index
values outside of this range.
In this case, there is only one field of extent 3 (
widget
)
in the original Progress table. This maps to a simple
List
of
character
objects in the DMO. However, there are
two fields of extent 5 (
order_number
and
invoice
)
in the Progress table. These are grouped together into a static
inner class (
Composite5
) of the top level class
Vendor
.
The mapping of fields with a common extent as properties of an inner
class allows Hibernate to manage these properties as a single,
secondary table (
vendor__5
-- see
previous section) mapped to a
single class (inner class
Vendor$Composite5
). The
naming convention for such inner classes is always:
<converted table name>$Composite<extent
size>
Generally, it is not necessary for application code to use these inner
classes directly, as public access to their data is provided by the
indexed getter and setter methods of the top level class. Thus,
these inner classes have package private access. Hibernate
accesses them directly through reflection to do its work.
Hibernate Mappings (TBD)
A top level class such as Vendor in the above sections is mapped by
Hibernate to one or more database tables using an XML mapping document.
Key Constraints
Primary and foreign keys constrain the relationship between database
tables to enforce data integrity in a relational database. In
general, tables can be related in
the following ways:
- not at all;
- one-to-one
relationship:
one record in table A corresponds to exactly one record in table B
(example: a "home-address" field in a customer profile record in one
table might correspond to a single record in a table of mailing
addresses);
- one-to-many
relationship:
one record in table A corresponds to zero or more records in table B
(example: a single customer profile record in one table might
correspond to many records in a table of customer orders);
- many-to-one
relationship:
the inverse perspective on a one-to-many relationship;
- many-to-many
relationship: one record in table A corresponds to zero or more
records in table B and vice versa. Note: conversion
currently does not support this type of relation at the schema level.
To the extent possible, this type of information is gathered through
analysis of a Progress schema dump and through analysis of data access
logic throughout a Progress
source codebase.
Progress has no concept of an explicit primary key to foreign key
relationship defined at the schema level. All table relations are
defined in Progress source code, rather than in the database
schema. A relational database, on the other hand, typically
defines these relationships in the schema, by assigning a primary key
in one table and a corresponding foreign key in another. Since a
relational database relies heavily on such relations, it is important
to determine to the degree possible, how to infer such relations from a
Progress schema.
Nevertheless,
the introduction of key constraints between tables where there are none
defined today in the Progress schema may introduce unwanted
restrictions on data insert and update in the resulting
application. As a result, we introduce
only those key constraints which can be positively inferred from the
existing Progress schema and application codebase. The
following techniques are used for this analysis:
- Analyzing Progress table indexes to detect fields which are
primary key candidates
- Analyzing natural joins
between Progress tables
These techniques are discussed in more detail below.
A goal of conversion is to define a primary
key for every table in the
relational database schema. A primary key must be non-nullable
and unique. In the converted schema, the primary key will be a
single-column,
surrogate
key; that is, a column which has no business meaning, but is used
only for database purposes. Where possible, the same name is used
across tables to indicate the primary key column. A new column is
introduced for this purpose.
Where a foreign key relationship has been established, the Progress
field(s) which enabled the association in the referencing table are
dropped. These are duplicative with the corresponding fields in
the referenced table. The association which they enable with the
referenced table is now maintained with a single column foreign key
which references the surrogate, primary key in the referenced
table. In the referenced table, these corresponding fields
remain, and the referencing table must now access these columns via a
join. In this case, data integrity is maintained during data
import, by ensuring
that records in referencing tables are correctly fixed up to use the
new
surrogate primary key instead of the existing, referencing
field(s). An illustration of this concept follows.
In the old relation, columns 1, 2, and 3 represent a multi-column
primary key in Table A and columns 1', 2', and 3' in Table B a
multi-column foreign key:
Referenced
Table (A)
|
|
Referencing
Table (B)
|
column
1
|
|
column
1'
|
column
2
|
|
column
2'
|
column
3
|
|
column
3'
|
additional columns...
|
|
additional columns...
|
Post-conversion, a surrogate primary key has been added to Table
A; a referencing foreign key column has been added to Table
B. Columns 1', 2', and 3' have been dropped from Table B.
Access to this data is now achieved via a join between the tables:
Referenced
Table (A)
|
|
Referencing
Table (B)
|
surrogate
primary key
|
|
surrogate
primary key
|
column
1
|
|
foreign
key to table A
|
column
2
|
|
additional columns...
|
column
3
|
|
additional columns...
|
|
|
Progress Table Indexes
The term
index in Progress
does not have exactly the same meaning as it does in a relational
database context. In Progress, and index is used for multiple
purposes:
- to aid in performance during data access, including in joins
between tables;
- to enforce uniqueness constraints;
- to allow the database to perform natural
joins between tables;
- to determine the sort order to be applied to the results of a
query.
In an RDBMS like Postgres, the primary purpose of an index is
(1). A secondary purpose is (2), though this is better handled
through use of a unique constraint. Indexes are not intended to
address (3) and (4); other mechanisms exist for these
purposes. We leverage the information
stored in a Progress schema about indexes (points 2 and 3 above) to
infer which fields in a
Progress table behave like primary keys would in a relational
database.
If one or more indexes are defined for a Progress table, one of them
must be designated as the
primary
index. This designation does not connote an analog for a primary
key; instead, it simply defines the index to use as the default
index for purposes of retrieving and sorting records in a query.
The designation of a Progress index as
unique is more useful when
attempting to determine primary key candidates. If only one
unique index is defined for a Progress table, the fields in that index
essentially define a primary key. If a table defines multiple,
unique indexes, it is ambiguous from this analysis alone as to which
index should be used as a primary key. However, since we are
introducing surrogate primary keys, this point is largely moot, except
in cases where we can infer a foreign key relation from another
table.
Additional analysis
must be performed to make this determination. In such cases, the
foreign key column introduced into the "foreign" table must contain the
correct surrogate key from the "primary" table. It is necessary
when importing Progress data into the new relational schema, to
correctly match records using values found in the field or fields which
define the join between the two tables involved in such a relation.
Because of the significant difference in purpose, the conversion does
not carry over Progress indexes per se. New indexes are created
for the relational schema as follows:
- Foreign key columns are indexed at schema conversion time (these
are always single column indexes, since we use a single column
surrogate primary key). Primary key columns are not explicitly
indexed, as these are implicitly indexed already by the RDBMS.
- The first column in a Progress index on a large table, which
index is referenced by the Progress
use-index
directive,
is indexed. This assists the performance of the SQL order
by
phrase when sorting results.
- Simple indexes are created at code conversion time based upon
actual data access logic and statistics gathered during data
conversion/import. The following rules of thumb apply:
- small tables are not indexed, since the RDBMS query planner
generally will empoy table scans in this case anyway;
- boolean columns are not indexed, since the range of possible
values is so low;
- large text columns are not indexed, since most data access
consists of unanchored text searches which do not benefit from indexes;
- given a data access search condition detected in Progress code,
the first column to pass the above screening tests is indexed.
Note that step (1) above represents a hard and fast rule; the
indexes created for that step will only need to change if the relations
between tables are changed (by schema conversion hints, for
instance). Steps (2) and (3) are much more fuzzy, however, and it
is likely that these decisions could be overridden based upon
production statistics and DBA experience. In general, conversion
strives to generate only single-column indexes.
The conversion also considers index redundancy. Two Progress
indexes are considered redundant with one another if they:
- both have the same uniqueness characteristic (i.e., they are both
unique or both
non-unique); and
- both define the same leading components.
Given two redundant indexes, one is dropped during conversion according
to the following rules:
- if they differ in component count, the less specific index (i.e.,
the one containing fewer
elements) is dropped; otherwise
- if one is the primary index, the other is dropped; otherwise
- if one is referenced by use-index and the other is not, the other
is dropped; otherwise
- if neither is referenced by use-index, one is dropped
arbitrarily.
In any event, if a dropped index is referenced by
use-index
or is the primary index, the dropped index is flagged to delegate
references (including default references to the primary index) to the
surviving index.
Once redundant indexes have been dropped, the final disposition of
Progress indexes is one of the
following:
- they are converted to unique constraints (unique indexes) - only
fields surviving the conversion are considered for the constraint;
- they are used for composing SQL
order by
clauses
(primary indexes and those referenced in source code by a use-index
statement) - all fields are considered for sorting;
- they are dropped as dead entities.
Natural Join Analysis
A Progress schema can be analyzed to detect the potential existence of
what we will term
natural joins
between tables. However, these are only
potential relations, which do not
really exist unless realized in the source code through actual join
operations. Thus, an analysis of the schema only yields many
false positives. Natural joins are realized in the source code
using the
of
language keyword within a record phrase, as
in:
...tableA [outer-join] of tableB...
For a natural join to occur, the following conditions must be met:
- Both tables must contain at least one field with the same name
and data type. Additional common fields are permitted.
- There must exist a unique index in at least one of the two tables
whose index fields are the common fields of condition 1, or some subset
thereof. The index must not contain any fields which are not held
in common between the tables.
Given the above join between
tableA
and
tableB
,
the following information can be gleaned from the outcome of this
test. Assuming condition 1 is met...
- If
tableA
meets condition 2, but tableB
does not, a one-to-many relation between tableA
and tableB
is implied. Furthermore, a
primary key to foreign key relation between tableA
and tableB
is implied. The field
(or composite of fields) defined for the index which meets condition 2
is essentially the equivalent of a primary key. The matching
field (or
composite of fields) in the other table is essentially the equivalent
of a foreign key.
- If both
tableA
and tableB
contain a
unique index which meets condition 2, and both indexes define the
identical set of index fields, the relation between the tables is
one-to-one. The set of fields in the common, unique index form
the primary/foreign key relation. Essentially, this set acts as
both a primary and foreign key in both tables.
- If both tableA and tableB contain a unique index which meets
condition 2, but both indexes do not
define the identical set of index fields, it is ambiguous as to which
end represents the primary and which the foreign end of the
relation. In this case, an error is reported (with the intent of
triggering a manual review), but no join is created.
Uniqueness and the
Unknown Value
For purposes of the uniqueness constraint imposed by a Progress table
index, two instances of the unknown value (
?
) are not
considered equal to one another. That is, for a unique index
containing a field
fieldA
, which is not declared
mandatory, two records containing
identical data in all other fields and the unknown value in
fieldA
are each considered unique, and are both permitted in the table.
This has an interesting implication for schema conversion, in that not
all relational databases behave the same way. The value
NULL
in SQL is the equivalent for Progress' unknown value. Some
relational databases (e.g., PostgreSQL), consider two
NULL
values inequal, which lends itself to a clean mapping between the
unknown value and
NULL
. Other databases allow only
a single
NULL
value for a column with a unique constraint.
In the case where the backing database supports multiple NULL values
past a unique constraint, the conversion approach is to leverage the
unique constraint capability of the backing database to enforce the
constraints imposed by Progress' unique index. In the case where
the backing database does not support multiple NULL values past a
unique constraint, the conversion approach is to use a check constraint
with an SQL expression to enforce the required uniqueness.
Support for the latter case will not be implemented in the first
release.
Determining Query Sorting
Introduction
In order to accurately duplicate the behavior of the Progress 4GL
single record reading approach, one must dynamically select each record
of the specified table in the same traversal order which Progress would
generate. This problem is complicated by the feature provided by
Progress where the same record may be visited more than once if its
data is changed in a manner that reorders it later (or earlier if the
walk is backwards) in the sequence of records. This is very
different from SQL's set oriented approach where one can be sure that
the set returned is of a static size. Besides the obvious danger
of creating infinite loops or otherwise non-intuitive behavior, this
requirement means that each "next" or "previous" record retrieval must
be based on a relative movement between records instead of a walk of a
static list.
In order to duplicate the traversal order of Progress, the relative
movement from one record to the next/previous must be based on
knowledge of the exact sorting criteria used by Progress. As with
many of the behaviors of the Progress language, there is a mixture of
explicit and implicit factors that determine the exact sorting criteria
used for each query.
The Progress database accesses all records via an index, via the
combined use of multiple indexes or via a table scan. When a
single index is used, to the degree that the requested sorting criteria
doesn't exist or in some cases where the sort criteria doesn't fully
specify an unambiguous ordering (but is compatible with the index), the
index used determines the order that the records are returned. In
other words, the index in use can and often does determine (or
partially determine) the real sort order.
In the multiple index and table scan approaches, if no explicit sort
order is specified, the sort order is undefined. In addition, even in
cases where the sort order is explicitly specified, it is possible that
records can be ambiguously ordered. When this occurs, the
"secondary" sort order is undefined.
The BY phrase is used to explicitly specify a sort order. It is
especially important to note that the sort order can be
completely implicit (if no BY clause is provided), partially implicit
(if the BY clause does not unambiguously specify an exact ordering) or
completely explicit (if the BY clause results in an unambiguous
order). For this reason, it is possible (and even likely) that a
given Progress application will have ordering dependencies built in,
possibly without the specific intention of the original author.
This may appear as the order of records in a report or the order in
which records are displayed in the user interface.
Implicit sort behavior in the single index case is dependent upon the
selected index. For this reason, one must duplicate the implicit
index selection that the Progress compiler makes. This is driven
by analysis of query predicates (WHERE, OF and USING clauses), sort
specification (BY clause) and explicit index specifications (USE-INDEX
clause). These rules may be used
for both FIND and FOR EACH type queries. Since there are
differences in what can be done with these two query methods, some
rules may not apply to both types. Of most importance, BY clauses
can only occur in a FOR statement (FIND statement sort order is always
based only on the selected index).
The following text describes the rules by
which the sorting criteria are determined in Progress.
Single versus Multiple Index
Selection
As noted above, some ambiguity in sorting is resolved using the index
selected for record reading. This point really assumes that a
single index is in use. Multiple indexes can be selected and are
used together to read the specified records. In the multiple
index case, all non-explicit sorting behavior is undefined. It is
documented that in the single index cases, the 4GL developer can
sometimes rely upon secondary sort behavior based on the index.
The FIND statement only ever uses a single index. The FOR EACH
statement, GET statement and usage of PRESELECT all use multiple
indexes whenever possible (assuming the -v6q command line option is not
active). This multiple index support appeared in Progress v7 and
applications that had a dependency on single index selection (an
implicit sorting dependency) needed the v6 query compatibility mode
(the -v6q command line option).
Even though certain language constructs are enabled for multiple
indexes, it is the query predicate which determines if a single index
or multiple indexes are chosen. In particular:
- Having OR criteria makes this much more likely. For block
queries if an OR operator is involved multiple indexes
will be selected even if there is only a subset of the index components
that are involved in equality or range matches.
- With AND criteria an index must have all index components
referenced in the predicate in equality matches (see below). This
means that indexes that have only one component and predicates that
reference a superset of fields in multiple indexes are more likely to
trigger this condition.
- Is mostly important when there is no BY clause.
Sorting Behavior
When a single index is used and there is no explicit sort order, the
sort order is defined by the index. This is something that is
relied upon by 4GL developers.
An important point is that when multiple indexes are selected any
implicit sort order is undefined. It is documented that the 4GL
developer should not rely upon any particular sort order that may be
implicitly provided in the multiple index case since this may change in
future releases. One is expected to explicitly specify BY clause
in this case OR the code should be independent of the resulting sort
order.
To the extent that a BY clause does not specify an unambiguous sort
order, the sorting within the "buckets" defined in the BY is usually
undefined. In other words, if multiple
records have the same data in the given set of fields specified in the
BY clause, the sort order of these records will be undefined even
though the overall sort order of the query is defined. This is
called the secondary sort order.
The exception to this is the case where there is a single index used
for record access AND there is an explicit BY clause whose fields, the
order of those fields and the direction of each field's sort (ascending
or descending) all match the leading subset of the index in use.
In this case, the index and the sort criteria are compatible and the
sorting is completely specified (and implemented) by the index on the
Progress database server. This means that in such cases, there
will be a reliable secondary sort ordering based on the additional
(more specific) fields in the index that were not part of the BY clause.
Otherwise, all secondary sorting behavior (for both single and multiple
index cases) is undefined.
Some simple tests of explicit sorting have shown the following behavior:
- The BY clause is honored as the primary sort criteria.
- The selected indexes can affect the secondary sorts. Other
than the exception noted above, the rules by which this occurs are
undefined.
- In some cases a secondary sort is based on the record create or
recid/rowid order. Without further tests, the actual criteria
here is not
visible in since the current tests used records that had been added
sequentially. This means that the create order is the same as
rowid order in the current tests. A more
active set of data with volumes of deletes and adds might be able to
differentiate
this last
rule grey area.
These rules hold for both single and multiple index cases. It is
not known in which cases the indexes affect the sort nor is it known
the extent to which the create/rowid order affects the sort. It
is also unknown how these two methods may be chosen separately or used
together. All this undefined behavior has not been documented nor
determined through testing.
It is possible to imagine a situation where the sort
order in the multiple index case makes an important difference in an
application:
- Report sort or record navigation visible to end user.
- A situation where which record gets processed first is important.
One may delete a record and then create the same record and the
resulting record may have a different rowid. Since rowid is a
physical pointer to the location of the record in the database,
anything that changes the physical order of records in the database may
change the secondary sort order in these undefined cases. This
means that reloading the database, exporting/importing, compacting...
activities can change sorting. Any applications that rely upon
this (at best) have a latent bug waiting to occur.
Since the resulting data imported into the P2J system will be based on
the .d export files (which are in primary index order), all physical
location information is lost in this export/import. To the extent
that this is important, the current approach cannot guarantee this
order.
None of the undefined cases will be supported at this time, but all the
defined cases will be supported. The following are the defined
cases (which can be relied upon by the 4GL programmer):
Sort Case
|
Result
|
explicit sort using BY +
incompatible index selection |
The sort will be completely
defined by the BY criteria. The index has no affect.
|
explicit sort using BY +
compatible single index selection cases
(implicit or explicit) for secondary sort |
Index driven sort. The BY
clause is only a subset so the index itself specifies the complete sort.
|
explicit single index choice
using USE-INDEX (with no explicit
sort) |
Index driven sort. |
implicit single index choice
(with no explicit sort) |
Index driven sort. |
Due to this set of rules, the single index used to select records must
be calculated. If in Progress, the result would have been a
multiple index, in P2J it will be calculated as a single index.
This may approximate some of the behavior of Progress, though it is not
expected to completely match the undefined behavior. The rest of
this document describes this index selection process.
Implicit WHERE Clause
Components
As with many things in Progress, a query predicate (i.e. record
selection criteria) can be specified explicitly (WHERE clause) or
implicitly (through OF or USING clauses and through the special
unqualified constant form of FIND). Before analyzing the query
predicate, one must add any
implicit portions to get the complete expression used to select records.
Since it is the WHERE clause that largely drives the selection of the
index, all implicit portions of the WHERE must be made explicit before
the index selection process occurs.
Unqualified Constant:
A FIND with a single constant (literal value) instead of a standard
record phrase is the same as making an addition to the WHERE clause
referencing a primary index (which must have only a single field) in an
equality match. Thus:
FIND customer 1 WHERE country =
"USA".
Assuming the customer table has a primary key which has the single
index component
cust-num,
this is actually the same as:
FIND customer WHERE (country
= "USA") AND cust-num
= 1.
OF clause:
The usage of the OF clause causes an *addition* of an equality match
for each field in the common index (an index that is common between the
two tables and unique in at least one of them). These equality
matches are connected via an AND operator. If the common index
between two tables is comprised of fields A, B and C:
FIND customer OF order WHERE name = "Bogus"
is converted into:
FIND customer WHERE (name = "Bogus") AND customer.A = order.A AND
customer.B = order.B AND customer.C = order.C
Note that due to the join-like nature of the OF clause, the OF itself
implicitly defines the index which is used to select records in the
target table based on equality matches with those common fields in the
preceding table. This means that this case actually does not
require any WHERE clause analysis since the index is
predetermined. This case is documented here for completeness but
the WHERE clause will not be rewritten.
Direct Logical Field References
A logical field
that is directly referenced as the entire predicate or as a direct
operand of a logical operator is actually considered to be the same as
an equality comparison with the literal true (thus it will affect the
index selection process described below). This means that:
WHERE
logicalField
is the same as
WHERE
logicalField = true
Interestingly enough, the use of the NOT operator in this case
does not get translated into the corresponding equality comparison (so
it
does not affect index selection as described below):
WHERE
NOT logicalField
is not the same as
WHERE
logicalField = false
USING clause:
The usage of USING causes an *addition* of an equality match
for each field referenced with the instance of that field in the
default or specified screenbuffer (FRAME). These equality matches
are
connected via an AND operator. So:
PROMPT-FOR customer.cust-num.
FIND customer USING cust-num WHERE country = "finland".
is converted into:
PROMPT-FOR customer.cust-num.
FIND customer WHERE (country = "finland") AND cust-num = INPUT
customer.cust-num.
An alternate form of USING usage exists for any field that is defined
as an abbreviated field in an index. In this case:
PROMPT-FOR customer.cust-num.
FIND customer USING cust-num WHERE country = "finland".
is converted into:
PROMPT-FOR customer.cust-num.
FIND customer WHERE (country = "finland") AND cust-num BEGINS INPUT
customer.cust-num.
The difference is that the "=" operator is converted into a "BEGINS"
operator.
From the Progress documentation:
"Abbreviate" is an index option that lets you conveniently search for a partial match based on the first few characters of a field (like using BEGINS) in the FIND ... USING statement. This option is only available on indexes that have a character field as their last index component.
Single Index Selection Rules
Indexes are selected based on explicit and implicit criteria. In
the absence of an explicit index specification (USE-INDEX), the
Progress compiler makes an implicit index selection. This
selection is based on an analysis of the query predicate. The
idea is that the Progress database server is completely driven by index
processing (or table scans if no index can be used). There are a
very limited set of operations that can be specified in a predicate
which can be translated into index operations. Anything that can
be translated into an index operation can be resolved by the database
server without the Progress client evaluating expressions to determine
if a given record matches.
In other words, a single query predicate can be processed completely on
the client, completely on the database server or in both places.
The degree with which the predicate can be processed on the server has
a great impact on the performance of the database access since it is
highly inefficient to read many records from the server only to reject
them at the client when they do not match the predicate. This is
a statement of data transfer volume (a less visible factor in shared
memory implementations than on network implementations). More
importantly it is a statement of the efficiency of general purpose
interpreted 4GL expression performance compared to a database server
that is tuned for such processing.
Client side operations are anything:
- with no functional counterpart on the server (e.g. arithmetic
operators like +)
- which operates on a database field that is not an index component
(only index components can be used to traverse a table, otherwise the
table must be sequentially scanned)
- which accesses data/state that only exists on the client
This analysis must identify all those operations that are not client
side operations. This set is then the valid conditions that can
impact index selection (since they can be used on the server).
The rules below specify the relative weight of these conditions
(probably due to research showing the performance value of these
different types of index traversals).
An operation is specified by a logical comparison operator. The
set of operators known by the database server is limited:
Operator
|
Match
Type
|
EQ, =
|
Equality |
LT, < |
Range |
GT, > |
Range |
LE, <= |
Range |
GE, >= |
Range |
BEGINS |
Begins (this plays a dual roal
in both rules 4 and 5 below)
|
CONTAINS |
Word |
Any other comparison operator than noted in this list, is unknown to
the database server. That means that NE (<>), NOT, MATCHES
cannot be processed on the database server.
This means that when any of the above operators are encountered, the
compiler is allowed to examine the two operands to determine if one of
them is an index component. If either of the operands (it does
not have to be on the left side only) is an index component, then this
may modify the index selected (based on the rules below). For
example, if one encountered a sub-expression
indexField = 5, then this would
be considered an "equality match" on the indexField field.
The only other operators that are recognized during this analysis are
the conjunction and precedence operators. In the single index
case, the AND
operator allows the compiler to
analyze both operands for index selection purposes, but the OR operator
does not allow operand analysis. This makes sense when one
considers that in a single index case, the best fit index will
naturally be able to reduce the result set based on an AND
condition. But the OR operator makes it significantly harder (or
sometimes impossible) to ensure that the results of a single index can
fulfill all possible conditions.
In the multiple index case, the AND and OR operators may both have some
support however this has not been explored or documented.
Parenthesis (the precedence operator) neither hinders nor helps the
analysis process since it merely ensures that the structure/precedence
of the expression is forced into a known pattern. This resulting
structure is what is analyzed, after the parenthesis have already been
applied.
Any operand of one of the
above operators that is not a recognized operator or an index component
(field) will be ignored and it (and its children) will not have any
impact on the index selection. This means that all other
expression features (and their operands) are ignored. Another way
of saying this is that these are not matches as termed above, they have
no impact on the index selection. This is the list:
- literals
- arithmetic operators
- builtin functions
- user-defined functions
- variables references
- methods
- attributes
- field references to fields external to the current table
Since an operand can be a sub-expression, expressions can be
arbitrarily complicated (nested). As soon as one encounters one
of these "unknown" features the analysis skips over that entire
sub-portion of the expression and moves on to the next operator or
operand until the entire expression has been processed and/or
ignored. This is true even if the one of the components of that
sub-expression is a field reference in the current table. For
example,
price + 5 = 50
will not affect index selection even if
price is a field in the current table that is also a leading component
in an index. This disqualification from the process
is for that part (operator and operands) only not all selection
criteria.
A sort match is a special type of match that
is determined based on analysis of a BY clause rather than by analyzing
the WHERE predicate. Identifying sort matches using BY clause analysis
is complicated by 2 factors:
- There is a specificity to the order in which the BY clauses
occur. The first BY clause is the least specific component of the
sort order (the least granular which forces the overall ordering of the
results). Each subsequent BY clause only changes the sort order
within the larger/broader categories created by the previous sort
specification. Thus subsequent BY clauses are increasingly
specific sort criteria. This order must be preserved in the list
of sort matches and any comparisons must take this order into account.
- There is a direction component to the BY clause (implicitly this
is ASCENDING but it can be explicitly specified as DESCENDING).
One
would think that if a BY clause references an index component field BUT
the sorting direction is not compatible with the index component's
definition, then that field reference would not be a sort match.
In
fact, Progress still does count this as a sort match (based on index
selections reported in the compiler's XREF output). The exact
meaning
here is unknown and this index selection is likely to be useless since
it seems that the Progress client would be forced to ignore the
ordering provided by the server index and handle sorting itself in this
case.
- The expression in a BY clause can be more complex than a simple
field reference. Only a simple field reference can be counted as
a
sort match as more complicated expressions must be resolved on the
client at runtime (and thus they cannot be used with an index so they
do not affect index selection). This idiom means that the *result*
of the BY expression is sorted. In other words, one can use a n
arbitrary expression such as the built-in function substring(...) to
extract, derive or otherwise transform the data and get a string or
integer... that then defines the sort order.
For each of the match types (equality match, range match, begins match,
word match
and sort match) a set of
statistics is recorded by index component that was matched. This
information is then used with the following rules to pick the best
index.
This list of rules is specified in priority order (1 is the highest
priority). Each priority level is checked in turn, from highest
to lowest. At the start of processing, there is a "selectable"
list that includes all possible indexes for the given table. As
each priority level is processed, indexes that are selectable based on
the criteria at that level will remain in the selectable list and all
other indexes will be removed. If the selectable list contains a
single matching index after a given priority
level is evaluated, that index will be the selected index. If
there are multiple matching
indexes at a given priority level, then the reduced selectable list is
used in subsequent priority levels to provide a "tie breaking" process
to determine which index is selected. This general rule does have
some exceptions which are noted below.
The selection rules:
- An index that was specified in
USE-INDEX. The use of the
USE-INDEX phrase overrides any compiler efforts at
selecting the index(es) to use and limits the selection to that single
index. This is the (one and only) explicit index case. Only
one USE-INDEX
clause can be present in any record phrase, so there is never a need
for a tie
breaker for this priority level.
- A unique index with all
components in equality matches. When each and every index
component of a unique index is involved
in an
equality match that index will be selectable. Since there can be
more than one unique index and the predicate could cause the selection
of multiple
unique indexes, the unique index that has the most components will be
the one chosen. If there are multiple unique indexes that are
selectable with the same number of components, then this list of
selectable choices will be further processed starting at rule 7 below
(this is a tie breaker). In other words, if there is more than 1
unique index that is selectable, rules 3 - 6 are bypassed. If
only one unique index is selectable, that index is used. If no
unique indexes are selectable, then processing continues with rule 3.
- A word index referenced through
contains. When a word index is referenced through a
contains operator, that
index is selectable. Since there can be more than one word index
and the predicate could
select multiple word indexes, then this list of selectable choices will
be further processed starting
at rule 7 below (this is a tie breaker). In other words, if there
is
more than 1 word index that is selectable, rules 4 - 6 are
bypassed.
If
only one word index is selectable, that index is used. If no word
indexes are selectable, then processing continues with rule 4. A
word index cannot be used in FIND statements.
- Most sequential leading equality
matches. Due to the special nature of rules 2 and 3, when
this rule is evaluated, it always processes starting with the full list
of all possible indexes as the potential indexes. The indexes that have
the most sequential leading components with
equality
matches are selectable. Most is interpreted in an absolute sense,
not a relative or percentage sense. For example, if 1 index has
the first 4 index components all involved in equality matches, and no
other index has 4 or more leading index components in equality matches
then this index is chosen. If no
indexes were selectable via this rule, then the full list of all
possible indexes will be checked using rule 5. If more than 1
index has the same
number of leading index components in equality matches, then this
reduced list of selectable indexes further refined:
- The
next index component (after those already known to be involved in an
equality match) of each selectable index must be evaluated to see if it
has a BEGINS match in this component. For example, an index with
its first 3
components with equality matches and the 4th component as a BEGINS
match will be selected
over an index with 3 equality matches (only).
- The equality matches must all occur
before the begins matches and all equality/begins matches must still be
sequential (the first index component without any match ends the
calculation). Only one component after the equality matches
is checked for a begins match.
- If only 1 index is selectable after the begins match check,
this index is chosen.
- If more than 1 selectable index is still selectable after the
begins test (i.e. if multiple indexes have the next component as a
begins match), then this
reduced list of selectable indexes is evaluated by rule 6.
- If none of the selectable
indexes were matched via this begins tie breaker, processing continues
in the next sub-rule (f).
- The
next index component (after those already known to be involved in an
equality match) of each selectable index must be evaluated to see if it
has a range match in this component.
- There is no concept of "most
additional sequential range
matches" to which the PSC documentation (the Monographs at least) are
referring when they list "most range matches" as an index selection
criteria. Only 1 range match is ever considered.
- If only 1 index is selectable after the range match check, this
index is chosen.
- If there are still more than 1 selectable indexes that the next
component in a range match, then
this
reduced list of selectable indexes is evaluated by rule 6.
- If none of the selectable
indexes were matched via this range tie breaker, processing of the
selected indexes (all with the same number of equality matches)
continues in rule 6.
- An index with its first
component involved in a range or begins match. All indexes
that have
their leading component in a range or begins
match are selectable. There is no concept of "most sequential
leading" matches when it comes to range/begins matches (contrary to the PSC
documentation/Monographs).
Range/begins matches to fields other than the first index component are
ignored for index selection purposes. If more than 1 index is
selectable, then the reduced list of selectable indexes is evaluated by
the rule 6.
- Most sequential leading sort
matches. The indexes that have the most sequential leading
components with sort
matches (an index component directly
specified in a BY clause rather than the WHERE) are
selectable. For a sort match to be valid, the index components
must be matched in index component name, specificity and sort
direction. Most is interpreted in an absolute sense, not
a relative or percentage sense. For example, if 1 index has the
first
4 index components all involved in sort matches, and no other index
has 4 or more leading index components in sort matches then this
index is chosen. If more than 1 index has the same number of
leading
index components in sort matches, then this reduced list of selectable
indexes is evaluated by rule 7.
- The primary index.
This rule is special because it can be "directly" reached from multiple
rules above. When this rule is evaluated, the selectable list may
include all possible indexes (if no rule above was matched) or it may
include a reduced set of selectable indexes (if one or more of the
rules above selected indexes). Whichever is the case, this list
is searched for the primary index. If the primary index is
selectable, then this index is selected. There can be only one
primary index and even if there are no indexes defined for a table
there is an implicit primary index that is the same as the rowid
order. There can never be a "tie" here but in the case where the
primary index is not selectable, rule 8 is used.
- The index with the
alphabetically first name. In general, the index name
(from the selectable
list) that is lexicographically first
(case
insensitive) is the index chosen. The only exception is if this
is being used as a tie breaker for unique indexes (rule 2), where
the lexicographically last
(case
insensitive) index name is the index chosen. Since all
index names must be different (in more than just case) within a given
table, this rule
will always result in a single index choice.
Because an OR operator (and its operands) are ignored in single index
selection, any expression using an OR operator as its controlling
decision will default to the
primary index as a "best" single index cannot be chosen. This
same conclusion will be the result if the WHERE expression is entirely
comprised of any of the above defined
ignored expression features.
For example, if the entire WHERE clause was comprised of a builtin
function such as
IF indexField
THEN sub-expression ELSE other-sub-expression would ignore the
indexField since it is an operand of the IF function. If this was
the entire WHERE clause, the primary index would be selected.
The order of index components in a predicate makes no difference in
index
selection. This means that the following two constructs are
identical in result.
A = ___ AND B = ___
B = ___ AND A = ___
Assuming that there is an index with sequential leading components A
followed by B, this index would be selected in either case.
Likewise, a field reference as an operand can appear on the left or
right side of a supported operator and it is still honored EXCEPT in
the case where both operands of a supported operator are direct field
references (to the given table) or contain
nested field references (to
the given table). In this case, both field references are ignored
(do
not affect index selection).
So of the following examples:
field = 4
4 = field
field = field2
field = entry(1, field2)
field
= field2 + 5
field + 5 = field2
The first two examples both result in
field being used in index
selection.
In the third, fourth
and fifth examples, both field and field2 are ignored even though one or both field
references are direct operands of a supported operator!
Field references (which are index components that would normally affect
the index selection) will be ignored if they are operands of
unsupported operators. Likewise, such field references that occur
inside an ignorable sub-expression are also ignored. Only field
references that are direct operands of a supported operator will be
considered.
A supported operator is only
honored if it is the deciding/controlling operator for the expression
or if it is an operand of an AND conjunction which is accessible.
To be accessible, the AND operator cannot be contained within or an
operand of an ignorable sub-expression or ignorable operator (including
OR).
This means that in the following:
(field =
64) = (field2 < 100)
Both field and field2 are ignored since their respective
operators are not operands of an AND operator nor are they the
controlling decision of the expression.
The following is a summary of the interactions between index component
references on both sides of each supported operator:
Operand Reference Summary
|
Right
Operand |
no
reference
|
direct
reference
|
nested
reference
|
Left
Operand
|
no
reference
|
no match
|
the sub-expression rooted
at the operator is evaluated
and substituted into the query predicate before the query is executed |
|
match right
|
the sub-expression rooted
at the left operand is
evaluated and substituted into the query predicate before the query is
executed |
|
no match
|
the sub-expression rooted
at the operator is evaluated
on the Progress client after
the query is executed |
|
direct
reference
|
match left
|
the sub-expression rooted
at the right operand is
evaluated and substituted into the query predicate before the query is
executed |
|
no match
|
the sub-expression rooted
at the operator is evaluated
on the Progress client after
the query is executed |
|
no match
|
the sub-expression rooted
at the operator is evaluated
on the Progress client after
the query is executed |
|
nested
reference
|
no match
|
the sub-expression rooted
at the operator is evaluated
on the Progress client after
the query is executed |
|
no match
|
the sub-expression rooted
at the operator is evaluated
on the Progress client after
the query is executed |
|
no match
|
the sub-expression rooted
at the operator is evaluated
on the Progress client after
the query is executed |
|
Index Selection Special Case
A FIND lookup by recid/rowid does not use an index.