Progress 4GL to Java (P2J)
Modification Date
|
January 21, 2004
|
Access Control
|
CONFIDENTIAL
|
Contents
Introduction
High Level Design
Common Infrastructure
Application Client/Server
Components
Data and Database Conversion
Code Conversion
Miscellaneous Issues
References
Trademarks
Introduction
By
converting Progress 4GL source code into
native Java source code, a customer can completely eliminate the
disadvantages of the Progress 4GL environment while leveraging the
considerable advantages in retaining the significant investment in
their custom application
itself. It is possible to create a set of runtime libraries and
conversion tools to automatically convert an application's source code
from
Progress 4GL to Java. These tools will be written in Java which
means that after conversion, the customer will have a pure Java version
of its application with all of the same function as the current
application.
Of equal importance is the fact that the user interface will remain
identical between the Progress version and the Java version. This
will eliminate any need to modify business processes
and retrain users. From the user perspective, the application
will be the same.
All the advantages of Java will accrue to the new Java version.
Thus any ongoing costs from Progress Software Corp. will be
eliminated as none of their technology will be in use. With no
dependency upon Progress Software Corp. the vendor related problems are
eliminated. From a technology perspective, Java is one of
the most popular and capable application environments. It
provides complete portability across all hardware and operating system
platforms (from mainframe to mini to PCs to handhelds). There are
more than 4 million (and growing) Java developers available worldwide
along with countless books, sample code and other technical
resources. This eliminates the artificial limitations due to the
proprietary nature of the Progress environment. Finally, there is
an entire industry investing in the Java environment and there is a
virtually unlimited pool of Java based technology available, which
the new Java version of the application will now be able to
leverage. The Java environment has
extremely strong technological potential and thus the Progress 4GL
issues in this regard are also eliminated.
Progress 4GL is based on a traditional interpreter plus runtime library
model. While it is highly procedural in nature, from a
functionality perspective the Java platform can be made into a complete
replacement. P2J provides an application environment and a set of
conversion tools which when added to the capabilities of the J2SE 1.4.x
platform, will allow this conversion to be accomplished. This
document describes the P2J design at a high level.
High
Level Design
The following diagram shows the high level components of the P2J
application environment (it does not show any of the conversion tools
or processes):

Common
Infrastructure
Secure Transport
Java systems use TCP/IP sockets as the premier inter-process
communication (IPC) mechanism. This is due to the following
reasons:
- Sockets support is built into the J2SE platform.
- Using the sockets support is very easy.
- It is the only IPC that is 100% portable across operating systems.
- It provides an IPC that is useful between processes on the local
system or separated by an IP network (Internet or intranet). Thus
the natural ability to distribute processing is built in by using this
IPC mechanism, the decision of where to run a particular process can be
made at implementation time rather than at the time the system is
programmed.
These advantages are significant and any secure transport layer will be
based upon sockets.
Requirements for a secure transport:
- Built on TCP sockets - provides the benefits of a connection
oriented, session protocol and wide interoperability on public and
private IP networks.
- Privacy - the payload of every packet must be encrypted using a
strong encryption algorithm. In this way, it is impractical to
read the data even if the packets are intercepted.
- Integrity - the payload of every packet must be protected from
modification. This way if the packets are compromised, it will be
detected at the receiving end. No hijacking of a session is
possible.
- Authentication - both ends of a session can determine the
identity of the other end point. This is important to ensure that
a secure session is not established with an impostor.
- Platform and language independence - while we will be using pure
Java to implement all P2J services and clients for the foreseeable
future, by design we are ensuring that we can choose a language and
platform without precluding our use or the interoperability of the
secure transport.
Transport Layer Security (TLS) is a standard for creating secure
sockets. This standard meets all of the requirements stated above
and has been selected as the basis for the secure transport. For
a primer regarding this technology:
TLS Primer
The key idea here is that every P2J module in the system, whether it is
a client, a server, a service... will have a unique identity in the
distributed system. This identity will be comprised of the
following:
- name
- IP address and TCP port
- public/private key pair and any associated certificate that has
been issued which authenticates our public key based on some trusted
3rd party
Most important in this case is the certificate which is how TLS
implements authentication. Between server/service modules (where
there is no "logged on" user), it is this certificate that is checked
before a session is established. A key assumption here
is that the keys and certificates are strongly secured at the operating
system level such that no unauthorized individual has access to these
credentials. Otherwise it would be possible to pose as an
impostor of one of the trusted P2J services.
Starting with J2SE 1.4.x, the Java platform has TLS support built
in. The toolset is called Java Secure Socket Extensions (JSSE)
and it provides a very simple interface for establishing TLS sessions
between two arbitrary processes. The effort to use TLS from Java
is only slightly more effort than to use an unsecured socket.
There is a bit more work to include the authentication capability,
though this is still quite modest. The most important feature
here is the key management, which has not yet been investigated.
Detailed Design/Implementation Process:
- Investigate the options for key management and distribution.
- Is it useful to implement key management and distribution?
- If useful, is it feasible to implement key management and
distribution in the first pass?
- If it is feasible, what options exist to implement this?
- Prototype a set of classes to implement a simple, abstracted way
of creating and using a TCP session between two arbitrary Java
processes. Define the user's identity (noted above) in a
configuration file and allow options to set the target IP address, TCP
port and whether the session should be secured or not. This will
allow us to easily substitute a non-TLS socket for testing the
performance overhead of the secure transport compared to a standard TCP
socket. Implement full TLS authentication on a bidirectional
basis, although ensure that authentication of a client can be disabled.
- Setup a set of testcases to simulate transaction traffic and bulk
data transfer. Use the prototype to measure performance metrics
for both the secured and unsecured sockets.
- Document the process for setting creating keys and certificates.
- Document the configuration and operation of the secure transport.
- Fully JavaDoc the API.
P2J Protocol
The primary focus of the P2J application environment is the execution
of transactions between a client and a server (synchronous
request/response). However, there is also a requirement for the
protocol to support forms of asynchronous messaging. A protocol
must be selected or created which can provide the following:
- Synchronous request/response transactions where both sides can
pass transaction specific data. These transactions can be defined
by a well known identifier and each one can have its own structure for
both the request and the response.
- Asynchronous events where either partner in session can send a
"one-way" event which is received by the other end point but which does
not generate a synchronous response. This is trivial to mimic
using a request/response transaction that has an immediate, empty
response. However the part that is not as trivial is when the
server wishes to send an event notification to a client. Since
the client is not normally listening for unsolicited events, this must
be planned for.
- All communication must be via the Secure
Transport.
At this time a publish/subscribe model is probably not necessary.
P2J protocol can be split into two logical parts:
- Low-Level communication protocol responsible for establishing
connection, serialization/deserialization, data formats, etc.
- High-Level application protocol consisting of a set of
application specific commands and/or API's and objects which are used
to wrap and hold data during exchange and processing inside the
application.
Low-Level Protocol alternatives:
- RMI
- Pros
- This is the native RPC of Java.
- It is reasonably simple to program because it makes remote
Java objects appear like local equivalents.
- It handles all the marshaling/unmarshaling of data (it uses
serialization under the covers).
- Sun has a sample (referenced the JSSE guide) showing how to
run RMI over TLS. This would be quite important for this to be a
real option.
- Cons
- It is Java only. There is no good interoperability with
other languages or platforms.
- Performance is an unknown. We have done other systems
using this and it has not been a problem but we haven't done specific
testing.
- An additional process is needed (the RMI Registry) on the
server side. This dependency may have negative implications for
load balancing/failover/redundancy.
- As a protocol, we have encountered situations where it can be
sensitive to running through a firewall (you can see different behavior
from the same client when it is running behind a firewall versus
running on the same network as the server).
- The local and remote objects are very tightly coupled which
increases effort and change required when making any change on the
remote side.
- Due to the transparency of remote calls (from a programmer's
perspective), it is easy for a programmer to build applications where
many more calls are made to the server than is necessary or
wanted. In
other words, because the usage of these objects is very similar to a
local object, one may easily build a system where the remote nature of
some objects is not
considered. If these objects are treated as true local objects,
then
it is easy to create performance bottlenecks. Detection of such
situations is complicated, so these kinds of problems tend to linger.
- IIOP
- Pros
- This solves the Java-only issue of RMI.
- Cons
- Not as simple as RMI.
- Not as natively well supported as RMI.
- May have more performance overhead than RMI because it is
designed to be language/platform neutral.
- RMI-IIOP
- Pros
- Solves the Java-only problem of RMI without adding the full
complexity of IIOP.
- Compatible with J2EE EJB. Distributed transactions
are handled via the Java Transaction Services (JTS).
- Cons
- Requires a CORBA ORB (Object Request Broker), see this link for
more details.
This can be extra cost and extra administrative overhead that could be
significant.
- All constant definitions must be of primitive types or String.
- The local and remote objects are very tightly coupled which
increases effort and change required when making any change on the
remote side.
- Due to the transparency of remote calls (from a programmer's
perspective), it is easy for a programmer to build applications where
many more calls are made to the server than is necessary or
wanted. In other words, because the usage of these objects is
very similar to a local object, one may easily build a system where the
remote nature of some objects is not considered. If these objects
are treated as true local objects, then it is easy to create
performance bottlenecks. Detection of such situations is
complicated, so these kinds of problems tend to linger.
- Web Services (e.g. JAX-RPC)
- Pros
- Popular so this has a marketing advantage.
- Completely language and platform neutral.
- Cons
- Roll Your Own
- Pros
- Designed for exactly what we need from a functional and
performance perspective.
- Can be designed to allow minimal latency for asynchronous
events.
- No risk for redundant or unnecessary remote calls.
- Performance tuning/bottleneck detection can be easily done.
- Even complex changes in the protocol can be done in
controlled way, with minimal efforts and risks. Most likely such a
changes can be done without touching other parts of application.
- Extensible protocol with backward and forward compatibility
can be implemented without significant efforts.
- Classes which are used to hold and process application data
can be shared between client and server code or independently changed
if necessary.
- Asynchronous notifications can be integral part of the
protocol.
For example, replies can be split into fixed size chunks which can be
mixed with notifications. This approach can provide predictable and
controlled latencies.
- Cons
- Proprietary.
- Requires time to implement it.
High-Level Protocol considerations:
Possible choices depend on low level protocol. Most low level
protocols listed above are designed with the assumption
that objects are specially designed for the distributed
environment. For example, objects implement a specified interface
and/or have
limitations on data types. In this case high level protocol
logic is hidden inside Java "client" code. Also, all objects used
for building high level
part of the protocol must be implemented in both a server side and
client side version. In some cases it also requires
additional classes for registering
objects in a Naming Service. The following set of problems
may appear from
this approach:
- Support is complicated
- Extension and/or changes in protocol will require significant
effort.
- The possibility to build an extensible protocol with good
backward (and probably
forward) compatibility is unknown.
- Large data object transfers are hard to interrupt or mix with
asynchronous notifications.
Approach:
From the information provided above we got a conclusion that two
protocols may satisfy our needs: RMI and our own protocol.
Selection between them is complicated because there are only two
critical points
which may affect decision: language/platform neutrality and any issues
or limitations due to the design or implementation of RMI.
An effort was made to use RMI since it exists in a working, well
known form without further necessary development. However during
the course of the detailed design work, many limitations were found and
documented. The result is that RMI is not suitable for the P2J
low level protocol. Please see the following link to understand the limitations of RMI.
Instead a custom protocol has been designed to meet the project's
exact requirements.
Transaction Requester
Transaction Requester provides the client side of transaction
processing. It
hides details like how to contact the transaction server and how to
establish the
secure session (including user level authentication via
userid/password). It exposes
a set of transactions to the client application (usually the
presentation layer)
whose implementation and location are hidden. From the application
point of view the
Transaction Requester is an API (or Java interface) or set of API's
(Java interfaces).
The Transaction Requester consist of two main parts:
- Initialization Module
- Protocol Module
Initialization Module reads configuration, tries to establish TLS
connections with the server and performs other relevant startup tasks
(for example, it starts the name service, if necessary). If
initialization is successful Transaction Requester is ready and can
perform other tasks. The main problem with the initialization
module is
storing client configuration. The configuration contains security
sensitive data, in
particular certificates, so configuration must be protected from
outside access. On the other hand, the client configuration
should be
manageable and needs to be maintainable in a convenient way.
The Protocol Module implements the High-Level protocol (see P2J
Protocol), provides API for client side
and uses the connection established by the Initialization Module.
Transaction Server
Transaction Server provides the server side of our transaction
processing,
includes security management and transaction routing. To perform the
task, the
server manages "sessions". A session is opened at the
moment of
successful login and its state is managed by the Transaction Server
until the user explicitly closes it by logging off.
The session holds all state variables, including the execution point in
application logic,
active windows etc... The session state can be read and can
be changed by client requests by passing request data to the runtime
environment and processing
responses. This approach enables reconnection
to an existing session for example after client crash or long term link
downtime. Another interesting feature is session state logging
which is very useful for problem determination.
The Transaction Server consists of the following modules:
- Security Manager: implements Security
Model and provides API for all security related needs. All
security context is stored inside the Security Manager and cannot be
directly accessed or modified.
- Protocol Module: implements server side P2J
Protocol,
including server-specific parts. It interacts with each of the other
modules as needed:
- It interacts with the Security Manager for
purposes of authentication and userid/groups/password management.
- For session creation/closing it interacts with the Session
Manager. It also uses the Session Manager for access to the
session state.
- It makes the Transaction/Event Router visible to the client
side.
- Transaction/Event Router: provides connection to the Runtime
Environment. It enables location transparency for all
transactions or events. It provides queuing, potentially
including prioritization. It processes these queues and routes
the transactions and events to the correct destination.
- Session Manager: is responsible for holding and control over
state of all opened sessions.
A key point: the security context will *never* be passed from the
client to the server. To do so would be unnecessary and would
also cause a
security risk. It is unnecessary since there is a dedicated TLS
connection that is associated with a security context on the server
side. This means that every packet can be associated with the
right
context on the server without any additional information.
Secondly, if the server were to accept a context from the client and
assign this, then a
modified/compromised client application could escalate its own
rights... thus the server cannot honor any security context passed from
the
clients. The protocol will be designed
to remove any such possibility.
This security context will be made available to the entire server-side
P2J
environment using a
static ThreadLocal object. This will provide for fast, simple
access to the session object that is specific to the current
thread. This allows different threads to each have a separate
context without making the access to this context a reference that must
be passed through all levels of methods calls.
After establishing a connection
and completing a successful authentication there is no need for an
explicit security context
on the client. Implicitly it will exist in the TLS but we don't
need to control
it or even have access to it from
the client.
It should be noted that "transaction" in the description of the
Transaction Requester and Transaction server should not be mixed
with the concept of database or data model transactions performed by
the Runtime Environment. The Transaction Requester
and Transaction Server are working with "transactions" as a mechanism
for sending requests from client to server, processing it and sending
results back (with or without success). Such behavior is well described
as a "transaction" (a sequence of call-reply). Therefore in this
case we call the Requester and
Server, the Transaction Requester and Transaction Server respectively.
These transactions may or may not change state of transactions in the
Runtime Environment but there is no direct linking between these
concepts.
Security Model
The Security Model is the approach used in P2J to provide valid
users with their assigned access to resources and to prohibit valid or
invalid users from accessing resources to which they have no rights.
At the most basic level, there are 3 important concepts:
- subjects (users, groups, clients, servers)
- resources (application function or data which subjects can use)
- access rights (the mapping of which resource access is allowed
for valid subjects)
A database of subjects, resources and access rights (also known as
Access Control Lists or ACLs) is maintained as part of the Security
Model. A unified API provides a
transparent way to perform authentication, password, group and ACL
management
and access checking for other parts of the application. Access
checking is performed by a generic, pluggable security decision engine.
For the purposes of P2J, standardized access control
points in various places of the application represent resources.
A user can be assigned to one or more groups for the purposes of access
rights management.
Subjects
All users of the system will be represented by an account
definition in the P2J Directory. This account definition will
store information specific to the associated user, including
identifying information (e.g. passwords or passphrases, certificates),
policy rules (e.g. may only logon between 8am and 5pm on non-holiday
weekdays), status information (e.g. last logon time, account is
enabled/disabled) or other security relevant data. This data will
be accessible to the other components of P2J.
All servers and client programs will also be represented in the P2J
Directory. These are also considered subjects, although they do
not have a traditional password based authentication sequence. It
is important to associate rights with programmatic subjects just as
with users. This enhances security as it may disallow certain
information or function access from specific P2J components. For
example, by removing the ability to edit the P2J Directory from a
standard client, even highly privileged users (that might normally have
access to directory editing) would be required to handle directory
maintenance from a known special terminal. This approach of
applying security even to system components is also critical to
limiting the damage that can occur by any security flaw found in one of
those system components. This enables a layered, orthogonal
approach to security that greatly strengthens the system.
P2J will provide group support. This is the concept that
access
rights (ACLs) can be managed and assigned to an arbitrary entity called
a group. Users can be then included in one or more groups,
allowing a
single policy definition to apply to multiple users. A group can
be considered as a class or type of users and this greatly eases
administration effort. From the point of view of ACL
management, groups look like
ordinary users, i.e. ACL can be assigned, changed or retrieved for a
group
just like for a user. Groups do not have assigned passwords
and
it is impossible to login into system using the group name as a
userid. Any user that is included in multiple groups will obtain
the union of all possible access for all groups for which they are a
member, with the caveat that the first explicit refusal of access (a
negative ACL - see below) will halt further access checks.
Various systems define different approaches to group management.
Some allow including groups into groups. Other systems allow just one
level of groups and do not allow including of group into
group.
The ability to include groups into groups provides more flexibility
for access right management but requires more attention (or better
clearly defined
policy) because of possible loops and unwanted rights
assignment. These systems are more complex for
implementation. Single level systems
are less flexible but don't have the issues described above.
Since there are no compelling reasons for implementing groups within
groups, a single level system will be used in P2J.
Authentication
Authentication is the process of associating a user or system
component with a system recognized identity (an account in the P2J
Directory).
Users are authenticated using a userid and password.
Authentication is performed by a pluggable component which implements a
standard interface. For example, the following components can be
implemented:
- A component which stores userids and passwords locally (in the
filesystem) and uses this data for authentication. Depending on
the authentication sequence passwords can be reversibly encrypted or
one-way encrypted.
- A component that authenticates a subject using the standardized
interface of an external directory service (e.g. LDAP).
- A component which uses an external directory service as the
storage/access mechanism for the authentication data, but handles the
authentication sequence itself.
- An implementation-specific authentication component that accesses
a specific set of devices, data or services that are customized to a
specific installation (e.g. a biometric devices like fingerprint
scanners).
Authentication components may:
- allow optional password strength checks and password aging
- integrate ACL management (and export this through an ACL
access/management API)
- implement the following authentication schemes:
- Ordinary userid/password authentication. In this case
userid and one-way encrypted password are sent via the Secure Transport.
- Private certificates are stored in an encrypted file on the
client side. The user provides a password which unencrypts the
private certificate which is used to establish a TLS session with the
server. The server has a record of the matching Public
certificate for that client (using userid to match the
certificates). If the digital signatures of the session
establishment match, then the session is allowed to continue and the
authentication succeeds. No
actual authentication password is actually used (no password checking
is done on the server side).
Based on analysis regarding the Secure
Transport (TLS), it is clear that compromising a client side
certificate does not allow an attacker
to read data transferred through that TLS connection. This makes
it unnecessary to implement any form of challenge-response
authentication scheme. Challenge Reponse is a scheme in which
after connection, the server sends the client a random challenge string
which
is then merged with the password. A hash function is calculated
for the resulting
string. The result returned by the hash function is then
passed to the server along
with the userid. The server side performs the identical
calculation for the reversibly encrypted password
stored in module storage. The server's result is compared with
the data provided by the
client. If these values match then authentication succeeds.
The advantage of this approach is that the password is not sent
over the connection, so it cannot be breached in this manner.
However, TLS provides sufficient safeguards by using a symettic session
key that is never sent over the connection (it is calulated based on
data that is sent such that only the server can decrypt it). In
other words, a TLS session is only vulnerable when the server side
certificates are compromised.
All schemes have strengths and weaknesses. In all cases server side
is assumed as secure, i.e. all security measures are correctly
implemented
and maintained. Examples:
- All software must be properly installed and configured.
- Minimum privilege and limited access rights are assigned to all
files.
- Physical access to the server is allowed
only for authorized personnel.
Ordinary userid/password authentication is simple to implement and
reasonably safe as long as the server is secure from the
point of view of third party access. The problem is that
passwords are generally weak and often are written down or are easily
guessed.
The individual certificates scheme does not have the problems of the
above
described
scheme since the user must have something unique (not just know
something) in order to access the server. It is the most secure
of the options. However, it
does require storing
personal data on the client side which makes maintenance more
complicated.
Another problem is that personal data also may be accessed by
unauthorized subjects. Encryption and/or storing of this personal
data on
individual removable devices can address this problem. In
addition, in a terminal oriented system the "client" side is really the
server side. If the server can be adequately secured, the
maintenance and security issues of the individual certificates approach
can be minimized.
It should be noted that it is of greatest importance to ensure that
the TLS
certificates are secured, since these may be the sole basis for
authentication in some implementations.
The ordinary userid/password
approach and the individual
certificates approach will both be implemented in the first pass of
P2J. In all cases, the P2J Directory will be the storage
mechanism for the authentication data.
Server (and possibly client) components will be authenticated using TLS
certificates. As a result of this it is critical to separate and
isolate different components into different filesystem paths.
Then by associating that process with a specific operating system
account and only allowing that account to have access to the filesystem
locations defined for it, the certificates can be reasonably secured
and a breach of one server or service will not necessarily represent a
security breach of the entire system. As noted above, this approach requires that the hardware
and operating system security is implemented properly. P2J has no
control over this level of security and is completely dependent upon it.
Communication Session Integration
All P2J clients must process authentication as part of setting up
the
secure transport and connecting to the server. Nothing is accessible
or visible before authentication succeeds.
Once authentication succeeds, the user's identity is established and is
associated with the secure transport session in use. This
identity is used as the subject in security decisions. When
a request for access is made from a non-client module, then the
identity is established via the key based authentication of the secure
transport itself.
Resources
and ACLs
Standardized access control points (resources) will be established
in:
- Transaction Server
- session establishment
- application transaction access
- routing decisions
- menu processing and navigation in the UI
- Data Model
- database/table access
- data filtering at the row level
- P2J Directory
- elsewhere in the code as is found based on the source application
Each such location is considered a different resource. A
resource type is defined as any unique decision hook (access control
point) that implements the unique backing logic. Where multiple
access control points implement the same backing logic (usually through
shared code), all such resources would be considered to be of the same
resource type. Each resource type has a naming convention for
how it differentiates between which specific resource is being
requested. In addition, each resource type has a set of valid
ACLs by which it both defines the possible access methods and also
defines the rights (or lack thereof) assigned to specific
subjects. For example, in Unix systems it is common for there to
be the access rights RWX (read, write and execute). In
traditional SMB file/print servers (e.g. IBM's LAN Server) one can see
CDRWXAPN (create, delete, read, write, execute, change attributes,
change permissions, no access) as the list of possible
permissions. In both cases, these are normally defined as a set
of boolean values (a bit set) that can be easily defined and quickly
tested. All resource types will define their own set of possible
permissions along with the representation of those permissions in the
ACL itself.
Note that it is useful in some cases to have an explicit "no access"
bit (a "negative ACL"), which can override access that a subject might
otherwise have. By default, all hooks implement no access in the
case where no positive access right is defined.
All resources will be "invisible" when access is not granted.
There must be no indication to the subject as to whether or not a
resource actually exists if no access is allowed. This is an
important factor in limiting attacks that are designed to learn about
the resources of the system as a preface to making additional attacks
through other means. It is much harder to attack a system if one
can not even determine that a particular resource exists. In
addition, sometimes knowing that a resource exists is a security breach
itself if the resource is unique enough (e.g. through its name) to tell
the attacker something sensitive about the contents. For example,
if one resource type is a database and a database exists for each
customer, then being able to see the list of database names that are
disallowed still allows the attacker to get a customer list from the
system. For these reasons, no access always means invisibility.
Each access control point has a unique type, a name and implements a
hook that calls
the security decision engine for service. This hook is responsible for
reading the correct ACL data from the P2J Directory and properly
invoking the
security decision engine. The names of resources are used as an
identifiers
for the purposes of ACL set management (storing/retrieval/changing) and
the subject is known by the context of the current session.
All resources are named in unified way.
It is possible that some resource types may implement their own
hierarchical structure of related resources. A traditional
filesystem is a classic example of this. In these cases, access
to one resource can be made dependent upon prior access to higher nodes
in the tree. Resources should be able to implement inheritance
schemes and hierarchy traversal in a manner that makes hierarchical
resources possible.
Security
Decision Engine
The Security Decision Engine processes generic security rules (ACLs)
using the Expression Engine.
The rules specify a logical expression
that defines whether a particular subject (user or groups to which user
belongs)
has specific access rights to a particular resource. Each resource has
its own
set of possible access rights. This may be as simple as a boolean
permit/deny or
as complex as a bitmask of possible rights. The interpretation of
the rights is up to the particular code that provides access to the
resource. This code implements the hook to call the security
decision engine with enough information regarding the resource, the
rights requested and the subject. The security decision engine
then reads the related security rules and processes these rules to
determine the outcome. This result is specified as a logical
expression (a boolean outcome). The caller of the Security Decision
Engine is responsible for enforcing the resulting outcome.
The ACLs need to be generically stored and accessed, even though the
resources and access rights that may be referenced are arbitrary and
application specific. The security decision engine should not
need to be aware of the meaning of such values, but it should still be
able to properly evaluate the ACL. Besides the standardized data
(like resource name and the access rights bitmask), there may be an
arbitrary amount of application specific data that is associated with
an ACL. There may also be security relevant data associated with
other parts of the P2J Directory database such as a user's account.
This approach allows a generic security infrastructure to secure a
customized set of resources and implement the corresponding custom
access rights.
Directory Server
With the distributed design of P2J, it is critical that all clients and
servers share a common set of security and configuration data.
This will be accomplished by maintaining a centralized directory
service. This service will be responsible for the user
authentication, the storage and retrieval of security and configuration
data and access control over that data. It will provide this
information via the secure transport to properly authorized entities.
The stored data can be classified in 2 categories (the following are
only examples and are subject to change):
- Security Related
- System Entities
- devices
- users
- groups
- resources
- data model objects
- menus/screens
- transactions
- Attributes
- access control lists (ACLs)
- public keys/certificates
- authentication secrets
- Other Configuration
- starting/top level menu or screen
The following design decisions have been made:
- The directory service is logically separate from the rest of the
application server processes. It may reside on the same system or
it may reside on a separate system which is accessible via the secured
transport.
- The secured transport will be the only method of communication
with this service.
- Each client and server in the system will have the following
"bootstrap" configuration:
- public/private key pair
- name
- hostname/IP address and port of the directory service
- With the bootstrap configuration, any client or server has enough
information to contact the directory server and authenticate.
Once this occurs, the rest of its authorized configuration is
accessible.
- Lightweight Directory Access Protocol (LDAP) will be used as the
access mechanism for the directory data.
- A directory service that is specific to P2J will be the front-end
to the LDAP accessible directory server. Only this directory
service will be coded to LDAP. This abstraction layer will
provide a P2J interface internally and will use LDAP to access the data.
- The directory service will provide a configurable mapping between
P2J entities/attributes and the specific customer implemented LDAP
schema.
The specific design above provides the following benefits:
- Common directory storage between multiple applications, at least
one of which will be P2J based and others of which may be written in
any arbitrary language and running on arbitrary operating
systems. This is enabled by the logical separation of the
directory server as well as the choice of LDAP as the access
protocol. LDAP is so commonly available, data exposed via LDAP is
easily accessed from virtually any environment and OS platform.
- Logical separation of the directory service from the rest of the
P2J architecture allows the customer to make the implementation choice
of where to run the service (physical co-location with the other P2J
services or not) to maximize reliability, availability, security and
performance.
- The abstraction of the directory service from the backing LDAP
server allows:
- Other non-LDAP back ends can be implemented in the future.
- In the future a pluggable interface could be created to allow
implementation choice of different back ends by the installer.
- It will facilitate user hooks for authentication.
- LDAP specific code will be centralized and minimized rather
than being spread throughout the project.
- Customers can implement an arbitrary hierarchy and we can map
ourselves into the customer's tree as needed. This allows the
LDAP server to be used for a wide variety of applications rather than
forcing the customer into using an LDAP server for only P2J based
applications.
Please see the LDAP
Primer for more information on LDAP.
Detailed Design Process:
- Make a detailed list of requirements for the LDAP server software.
- Review the open source options for LDAP servers. The
preference would be for a pure Java LDAP server, if one is available
with the right function, reliability, performance and
scalability. Choose the best match.
- Once the security
model detailed design
has been completed, it must be converted into a specific set of LDAP
schema files suitable for implementation. These may be product
specific (e.g. OpenLDAP format) unless there is a well supported
standard for this. Note that we must map the security model into
the standard LDAP attributes that are available. Any additional
data in the security model which cannot be mapped into the standard
attributes must be mapped into custom attributes that are specific to
P2J. Make every effort to use the standard attributes if at all
possible.
- Design the directory service interface as presented to the P2J
application environment. Make sure to take into account the
requirement that an administration toolset will need add/edit/delete
access to maintain the directory database. In addition, the
security model must be handled cleanly and without hard coding to
LDAP-only security features.
- Investigate the effort to design a pluggable interface to
decouple the directory service front-end from the specific back-end
LDAP client. This would allow other (non-LDAP) back-ends to be
installed at implementation time, based on customer requirements.
- Design the format for the mapping file in which the P2J
configuration variables are associated with the implementation specific
LDAP schema information for proper retrieval.
- Document the requirements for the core directory service itself
(other than the public interface).
- Question: do we need to provide event notification (e.g. notify
clients and servers when a change is made to a particular value)?
- Question: what facilities would be needed to support a
redundant directory service with replication? Would this solution
also provide failover and load balancing?
- Design the administrative interface (requirements, user interface
and flow) for the directory service.
- Key question: should the admin interface be separate from any
other P2J administrative functions (dedicated to the directory service
only) or should there be a unified/centralized P2J administration
service?
- Should we do a web interface or can we save time by doing a
Swing interface?
- Do we need any form of scripting/automation or command line
interface to allow customers to more easily manage P2J on a
batch/program basis (the answer is probably yes, but do we need it in
the 1st pass)?
- Identify the source data tables and fields in the current
Progress database from which we can populate the directory server (at
least in regards to the user list, access rights and
menuing/navigation).
- Design the process for converting this data into the proper
format and loading it into the directory server.
Expression Engine
All general purpose languages provide support for expressions.
Logical expressions are those that generate a single boolean
result. Arithmetic expressions generate a numeric result (which
could be floating point or integer).
Many areas of the P2J environment require the ability to evaluate
logical or arithmetic expressions at runtime. These complex
algebraic relationships can be represented in custom written Java
source code. This has the advantage of doing exactly what a
particular P2J component needs with a minimum of code. The
downside is that such code needs to be written for every place in P2J
that needs similar functionality. Thus a great deal of redundancy
exists and/or there are many separate places in the code that will have
related processing. Testing, debugging and maintaining these
separate areas of the code will take significant effort.
The alternative approach is to create an engine that provides a generic
expression service. The expression language is common to all
locations in the code that need such processing. All expressions
are represented as data using this language. A common,
well-tested, well-debugged code base is used to evaluate the
expressions, regardless of the component in P2J that needs such
processing. This means the expression processing can be done
right, once and then leveraged in many places.
Golden Code has a Java Trace Analyzer (JTA) that includes a generic
expression engine. It is designed to be loosely coupled with its
client code, allowing different applications to provide expression
support using a common language and a well tested and debugged
technology.
Please see the javadoc for package
com.goldencode.expr for details on the API.
This expression engine may require some modifications to support all
functions necessary across P2J. The current engine accepts an
expression of which may contain user defined functions and user
resolved variables. The client of the engine submits an in-fix
expression as either a logical or arithmetic type. The expression
engine parses the expression and converts it into a post-fix
format. It then generates Java byte code for each element of the
expression, calling back to the client to obtain service for the user
defined functions or for resolution of variable values. This byte
code is written as a proper class format and is loaded and run.
This Just In Time (JIT) compiling of expressions means that subsequent
runs of the same expression (with different variable values) bypasses
the compilation step and runs immediately from the expression cache.
Expression Language Overview:
An expression consists of one or more operations.
An operation may consist of one or two operands and an operator,
such as
count = 105
where count and 105
are the operands, and the equals symbol (=) operator
indicates
the purpose of the operation (to compare the variable count
to the constant 105 for equality).
When evaluated using actual data to substitute for the variable
operand,
this expression will evaluate to true or false.
Alternatively, an operation may be self contained, as in the case of
a user
function. For example,
@pattern('Some text to look for', A)
A user function is evaluated as an atomic operation which returns true
or false. The value returned depends upon the purpose of the user
function. In this example, the function will return true if the
data being tested contains the ASCII byte pattern 'Some text to
look for' (not including the single quotation marks).
An expression may be comprised of the following elements:
- variables
that reference data being tested or modified
- numeric or string constant operands, such as 0xFF
or 'SERVER1';
- logical comparison operators,
such as =, >, <,
or, not,
etc.;
- logical conditional operators,
such as and (&&) or or (||).
- arithmetic operators,
such as +, -, *,
etc.;
- bitwise operators,
such as |, &,
and ^.
- user
functions, such as @pattern(), @in(),
etc.
- precedence operations (
and )
Operators
Operators within filter expressions may be logical or arithmetic,
binary
or unary. In order to
construct valid expressions, it is important to understand the relative
precedence
of the available operators, and to use parentheses to group
subexpressions
properly.
The set of operators which may be used within filter expressions is
listed in the following table. Operators in this table
are listed in order of their precedence, from those evaluated first to
those evaluated last. Operators which have the same precedence
are
grouped together. When evaluating operations whose operators have
the same precedence, the program will evaluate the operations in the
order
in which they appear, from left to right. Parentheses (())
may be used to group operations which must be evaluated in a different
order.
Precedence |
Symbol |
Type |
Unary/Binary |
Operation Performed |
1
|
! or not
|
Logical |
Unary |
Logical complement |
2
|
~
|
Bitwise |
Binary |
Bitwise complement |
3
|
*
|
Arithmetic |
Binary |
Multiplication |
/
|
Arithmetic |
Binary |
Division |
%
|
Arithmetic |
Binary |
Remainder |
4
|
+
|
Arithmetic |
Binary |
Addition |
-
|
Arithmetic |
Binary |
Subtraction |
5 |
<<
|
Bitwise |
Binary |
Left shift |
>>
|
Bitwise |
Binary |
Right shift w/ sign extension |
>>>
|
Bitwise |
Binary |
Right shift w/ zero extension |
6 |
<
|
Logical |
Binary |
Is less than |
<=
|
Logical |
Binary |
Is less than or equal to |
>
|
Logical |
Binary |
Is greater than |
>=
|
Logical |
Binary |
Is greater than or equal to |
7 |
= or ==
|
Logical |
Binary |
Is equal to |
!=
|
Logical |
Binary |
Is not equal to |
8
|
&
|
Bitwise |
Binary |
Bitwise AND |
9
|
^
|
Bitwise |
Binary |
Bitwise XOR |
10
|
|
|
Bitwise |
Binary |
Bitwise OR |
11
|
&& or and
|
Logical |
Binary |
Conditional AND |
12
|
|| or or
|
Logical |
Binary |
Conditional OR |
Operators by Precedence
A number of special purpose user functions exist to simplify the
encoding
of operations which would be cumbersome or impossible to represent in a
conventional expression. User functions may be embedded in
an expression, or may comprise the
entire expression. A user function is evaluated as an atomic
unit.
The syntax of a user function varies from function to function, but
is always of the general form
@function_name(arg_1[, arg_2...arg_N])
where:
- the @ symbol
- Designates the following text as a user function. Every
user function
must begin with this symbol.
- function_name
- The name of the user function. For example, protocol,
pattern, in,
etc.
- arg_X
- An argument required by the specific user function being
used. This
may be a variable or a numeric or string constant. Each user
function takes at least
one argument. Others require additional arguments; some
user
functions accept a variable number of arguments.
The user functions listed in the following table are examples:
Function Name |
Purpose |
in |
Test the data within a protocol field for a match against a
list of
constants |
isnull |
Test for the presence of a specific data field within a record |
pattern |
Test for a specific byte pattern within a record |
User Functions
Manageability Service
Warning: these features are desirable
and Golden Code does intend to implement them if at all possible in the
first release. However, these features are not critical to the
implementation and will be dropped if necessary to meet the final
milestone.
This is a logically separate process running on the same or different
physical hardware as other P2J application modules. It uses the secure
transport to authenticate and
communicate with an in-process agent
in the other P2J modules. For the purposes of this document, the
term "manager" will be another name for the manageability service.
The following lists the purposes of the manageability service:
- It is the central location where all P2J agents register at
startup and where periodic "heartbeat" notifications are sent.
The manageability server thus maintains a registry of all running P2J
modules and their high level status.
- It provides a central control point for sending management
commands to P2J modules. These commands can be sent on a
scheduled or ad-hoc basis.
- It provides the central collection point for arbitrary data
records relating to the specific status, security, integrity or
execution of P2J modules.
- Data will be forwarded by the agent running in-process in all
P2J processes.
- These records can be generated asynchronously by P2J modules
when an event occurs that has a severity level greater than the
currently set threshold for notification.
- The records can be generated as a result of a specific request
made by the manageability service to a P2J module:
- attributes
- a one-time query of an attribute (an exported variable)
value
- a request for a periodic report of an attribute value
- a request for notification of an attribute value when it
changes
- events
- events are generated by specific tracepoints which create a
predefined data record associated with every execution of a specific
point in the code path of a P2J module
- the manageability service can dynamically enable/disable a
specific list of tracepoints at runtime, this chooses the set of events
that can be generated
At a minimum, each heartbeat (#1 above) should contain the
identification of the P2J module, the identification of any
authenticated user that may be logged on (this would be done on client
UI modules and possibly other modules like administrative UIs) and any
other general purpose status information such as a timestamp, memory
usage/heap size or other process specific status information. The
first heartbeat may be a special one that includes more static
information that will not change during runtime, such as system
information or JVM specific values (e.g. Java properties). At the
exit of a P2J process, a final heartbeat should be sent which includes
a
notification that the P2J module is exiting as well as any return code
or condition codes that indicate the reason for the exit. This
should be sent whether the exit is caused by an abnormal or normal end.
Commands (#2 above) that could be sent to the agent for execution:
- display a message to the user or on the console
- enable/disable specific tracepoints (to forward data records to
the manageability server)
- query or set the value of specific attributes
- enable periodic or change notification monitoring of specific
attributes
- set the alert level for asynchronous events (the threshold at
which an event is important enough to forward to the manageability
server)
- run user specified Java code
- halt the process
There should be a mechanism for broadcasting a command or list of
commands to an arbitrary set of agents (based on the registry).
As a central collection point (#3 above), once the data is collected,
it will be processed by a set of filters/rules and based on the outcome
the data record can be routed to one or more particular output targets
such as:
- write it to a log file (probably using log4j)
- write it to a trace file
- send an email (probably using Java Mail)
- set the value of a local variable, increment a counter or
accumulate the value (to gather statistics/trending and/or implement
monitors)
- forward to another "upstream" P2J manageability service
- forward to another "upstream" management system (e.g. via SNMP)
- run an external program
- run a user defined hook (Java based Interface)
Each record received by the manageability service will be queued.
A daemon thread will read the head of the queue when the queue is
non-empty otherwise it is blocked. Once it finds a record, it
will dequeue the record and traverse the list of defined rules to
determine the disposition of the record. Each rule is made up of
an expression (written in a modified
form of the JTA filter language)
and a chain of targets that will receive the record if the expression
evaluates true. There will need to be a default disposition in
the case that no rule matches.
Detailed Design Process:
- Review the JMX specifications (such as they are). Make sure that
we understand (to the extent possible) the approach to the JMX Manager
so that we design our manageability service with the future goal of
conforming to JMX. Note that even if the JMX Manager
specification was available now, it is likely that we would not
implement it because of the expected TCK requirement. The TCK
requires a license agreement (and fees) for Sun and greatly limits the
distribution because every change has to be tested for compliance
before we ship. This burden makes sense when shipping a JVM but
it doesn't make sense when shipping an application environment that is
designed completely by Golden Code.
- Design the wire-level protocol to be used between the agent and
manager. It must support the heartbeat, asynchronous events,
asynchronous attribute values/change notifications and a
request/response transaction interface from the manager to the
agents. All functions defined above need to be possible via this
protocol.
- Evaluate and document the requirements for the incoming queuing
of data records. In particular, should there be any
prioritization of the records or should it be a simple FIFO?
- Document any unique requirements for the JTA expression language
and the
expected changes. This should include a mechanism for user
defined variables (referencing the event record being processed) and
any necessary user defined functions that need to be provided.
- Define the format for the rules definitions including providing a
chain of targets and any specific parameters that must be passed for
each target in the chain. For example, if one of the targets is
to log the record to file using log4j, then the log file name may need
to be parameterized.
- Define how much of the configuration of the manager is stored in
the directory versus provided ad-hoc at runtime.
- Define whether the input queue is persistent (survives a restart
of the manager).
- Design the registry and heartbeat processing. Include the
ability to define arbitrary sets of agents in the registry which can be
referred to by a group name. This would be used in broadcasting
commands.
- Define the Interface by which the registry and heartbeat status
information can be accessed.
- Define the Interface by which command processing will occur (by
which a command or list of commands can be submitted, the status of the
commands can be monitored and the results of the command(s) can be
obtained).
- Define the Interface by which events and attributes are
enabled/disabled and accessed in specific agents or sets of agents for
forwarding to the manager. This must provide for query, set
as well as the enablement of asynchronous forwarding.
- Define the Interface through which a new target can be added to
the list of possible targets. This will need to include a method
of passing the record data.
Manageability Agent and
Instrumentation
Warning: these features are desirable
and Golden Code does intend to implement them if at all possible in the
first release. However, these features are not critical to the
implementation and will be dropped if necessary to meet the final
milestone.
To the extent possible, the Java Management Extensions (JMX) will be
used to implement the P2J manageability agent. The JMX
architecture is not yet complete. At the time of this writing,
some of it is standardized and some of it is still in the JSR
process. The JMX instrumentation architecture and the JMX agent
architecture are the 2 parts of the specification that are currently
available. Sun provides reference implementations of these.
It is not known how mature and production-ready these reference
implementations are. If for some reason the reference
implementations are not suitable for production, then we will need to
implement our own minimum function with the design goal that moving to
a JMX environment in the future would be possible without a huge
rewrite. Note that Sun's approach to JMX requires that any
supplier making a JMX compliant agent must license the JMX TCK and
execute it to prove compatibility. For this reason, it is
critical that in the case that the reference implementation is not
sufficient, we will not implement any agent that is based on the JMX
APIs or JMX Agent specification. This way there is no Sun
certification or licensing requirement that inhibits the P2J project.
The functions of the P2J manageability agent are the following:
- Register the P2J module (in which the agent is running) with the manageability
service that is
defined as this agent's manager. For the lifetime of the process,
the agent will provide a periodic "heartbeat" that includes specific
predefined status data. At process exit, the agent will provide a
heartbeat that includes a notification of termination and the
reason/return code.
- Provide an access point from which the manageability service can
query/set and control the P2J module. This will be provided
through the secure
transport.
- Provide a local proxy for all local Java objects on behalf of the
manageability service. This will allow any local Java object to
send an event or provide an attribute value/notification.
- Provide an adapter to translate and forward logging from 3rd
party modules into the P2J manageability service. An important
example is the need to create an adapter for LOG4J (used by Hibernate).
Java objects in the P2J environment will be optionally instrumented to
support manageability. To the degree that is possible, this
instrumentation will be in accordance with JMX specifications (note
that this assumes a JMX compliant agent is in use). The JMX
specification does provide the ability to expose events and attributes
to the agent. The agent then becomes the gateway for these to be
accessible from the manageability service.
For more information on JMX, please see:
Sun JMX Home
JMX
Instrumentation and Agent
Specification, v1.2
JMX
Remote API 1.0
JMX
White
Paper
Detailed Design Process:
- Review the JMX specifications (such as they are). Make sure that
we understand (to the extent possible) the approach to the JMX
Agent and JMX Instrumentation so that we design our code with the
goal of
conforming to JMX (subject to the limitations noted above).
- Define the requirements for the P2J agent (it will try to use the
Sun reference JMX Agent and perform the functions of a connector as
well as some of the proxy functions for the manager).
- Prototype the solution using the Sun JMX reference
implementation. The main objectives are to determine if the Sun
JMX is ready for production and if it will support the function we are
trying to implement.
Application
Client/Server Components
Runtime Environment - Execution
Threading and
Multiprocessing
Progress 4GL has no threading or multiprocessing. Everything is
single threaded. Even features like persistent procedures are
implemented as a one time execution of the external block on the main
thread. The internal procedures, functions and triggers are
compiled,
loaded and made available to subsequent procedures, but at no time does
any of the code in a persistent procedure ever run in a separate
thread. Likewise, although event processing (e.g. user interface
triggers that can be executed from wait-for or other statements that
block waiting for user input) has the appearance of threading, all
event processing is done on the main thread in a synchronous
manner. Even the "process events" feature is just a cooperative
yield to the input processing to give it a synchronous opportunity to
handle any pending input.
This execution model makes the P2J implementation straightforward in
its architecture. For each service (a single JVM process) that is
running on the server, multiple client applications may be concurrently
executing at any given time. Each client represents a single user
and a single thread of execution. This exactly matches the
semantics of the single threaded Progress 4GL runtime environment.
Shared Variable Support
Shared variables are used heavily in Progress 4GL applications.
This is a consequence of the
fact that early in the Progress 4GL lifecycle, there was no such thing
as procedure parameters. Thus most code uses shared variables
(with newer functions using parameters). Unfortunately, the
definition and use of these is not as consistent as one might wish, so
it
can be difficult to find where something is defined or finding
all the
places that use that variable. One good practice (that is not
used as frequently as is optimal) is to use include files to centrally
define access to shared variables defined elsewhere. Note that
there are efficiency implications of using parameters (which presumably
require space on the stack) versus shared variables which are probably
handled internally with pointers. This would be especially
pronounced with
anything that is large in size, like an array.
Alternatives for converting global and scoped shared variables:
- a user-specific instance of a shared variable manager that is
centrally accessible and provides lookup facilities
- pass references to scoped shared vars as parameters to method
calls (if the points of access are few)
- provide access via a common ancestor (possibly by passing it to
the constructor of the object and then referencing it as needed)
- group related shared variables in classes, create instances at
the proper level of scoping and then access specific data in place
using a reference to the properly scoped instance and the associated
accessor methods (getters and setters)
When using the term "shared" and "global" it is important to note that
only the Java code executing on behalf of the current user can ever
access shared and/or global variables. Other users run in a
different context on the server have no access to these
variables. For this reason, P2J cannot use techniques such as
static methods to access such variables.
The choice of approach will be made at conversion time, but some
runtime infrastructure may exist to support these alternatives.
Detailed Design/Implementation Process:
- Design and implement the shared variable manager.
Procedure Handles
and Persistent Procedures
All procedures that are currently on the call stack in Progress can be
referenced using a unique "handle" that is associated with that
instance of the procedure.
The Progress 4GL also allows a procedure to be defined as
"persistent". The
addition of this keyword to a "run" statement causes the external
procedure to be run once and its context remains permanently in memory
for future
use (instead of on the stack). This means that any internal
procedures and triggers it defines
can be executed at any later time as long as one has the associated
procedure handle.
To use these procedure handles, one can store them in a variable and
then
use it with the "in" clause to a subsequent "run" statement.
Alternatively, one can use the "session manager" to walk the list of
procedures and based on attributes, the specific (persistent or not
persistent) procedure
can be found and executed.
In any case where the code explicitly uses a handle to reference a
loaded procedure (persistent or not), this code will very naturally be
represented in Java. This corresponds to the standard case where
a
Java object is instantiated and then a reference to this object is
contained and used as needed. One can just consider a procedure
handle
to be the equivalent of an object reference.
The status of a procedure as persistent or not is determined by the run
statement rather than the procedure statement (don't confuse this with
the "persistent" option on the procedure statement, which relates to
shared libraries only). This means that the caller of a procedure
determines its context (global or stack based). Thus the same
procedure can be run both ways and this has to be possible in the
target P2J environment.
In addition, it is possible to run multiple instances of a persistent
procedure with different state. By accessing the same internal
procedures and triggers, in different instances, the varying state can
be accessed. This means it is not possible to implement
persistent
procedures using static methods in Java.
The Progress 4GL concept of persistent procedures can be recreated in
Java by creating additional runtime infrastructure. An analog to
the
session manager can be created which keeps track of the chain of
procedures (persistent or not) and allows one to walk this chain and
inspect attributes and execute methods. To the degree that
Progress 4GL procedure attributes are used, these will need to be
mapped into J2SE equivalents or artificial constructs will need to be
provided to make this functionally equivalent. To the extent that
classes need to implement these features, the generated code will need
to derive from the right object hierarchy (inheritance) or will need to
implement the correct Interface definitions.
Detailed Design Process:
- Catalog all uses of handles in the existing (non-dead) 4GL source
code.
- Catalog all uses of the "session" system handle and its traversal
features.
- Document all requirements for the conversion of this usage into
native Java equivalents.
- For anything that can be represented directly using the standard
Java language (object references), define the mapping from source to
target environments.
- Design the interface for any handle oriented features that
require runtime support (cannot be implemented using object references
alone).
Business Logic
The converted application is exposed as transactions to the clients and
possibly to other servers. Each transaction is registered in the
directory and made accessible to clients
that have the proper authorization.
The application's business logic (decoupled from the user interface
processing) runs inside the runtime environment. It handles all
application processing and is the only direct client of the data model.
Input Validation
Transaction entry points into the business logic and many other general
purpose methods can benefit from strong and consistent input
validation. In most applications there are common rules for input
validation which are shared by many different locations in the source
code. Unfortunately, most of the time although the rules are
common, the implementation is not. Instead, the logic that
validates input is hard coded into each location where such validation
is needed.
P2J will implement a separated validation layer that is leveraged at
the transaction interface. Other methods will be able to access
this runtime service as it makes sense. In is not practical to
implement all input validation throughout the application using common
rules since in many cases the code to leverage a common infrastructure
might exceed the code to do the check inline. In these cases a
common implementation might obscure rather than help. However, it
is likely that the transaction interface will consistently implement
such a common approach. Transaction parameters can be
validated (on the server side) before the request is dispatched to the
actual transaction handler.
Note that due to the limitations of the proxy
approach P2J won't implement as a proxy.
Input validation rules will be specified using the common expression
syntax and will be evaluated using the Expression
Engine which will be driven by an Input Validation manager.
Entry points in the Input Validation manager will be called with the
data to be validated and a reference to the rule which defines the
expression. The rules will be stored in the database. Since
this process is data driven, it can be edited without recompiling
the system. This reduces effort and risk.
The validation being referenced here regards program input although
technically the approach has use in a more direct user interface input
validation as well. The rules are business focused and will be
generated during conversion based on validation done in the current
Progress source code.
Detailed Design Process:
- Catalog the validation logic in the current code base.
- For any logic that cannot be represented using the current
expression language, design and implement user defined functions to
provide the needed tools.
- Design and implement the variable resolution scheme and the
manner in which variable references will be specified in the
expressions.
- Design the format for the storage of the input validation rules.
- Implement the database tables necessary to support this storage.
- Implement the Input Validation manager.
Runtime Libraries
Progress provides a long list of language statements, built-in
functions and widgets or objects with associated
methods/attributes. For each of these kinds of Progress runtime
support, there will be a corresponding conversion plan for a native
Java implementation.
Many of the above features may be implemented by a straight or direct
mapping of code into a Java equivalent. For example, features
such as the IF THEN ELSE language statement directly translate into the
Java if ( ) {} else {} construct. Other features might require a
backing method, but the J2SE platform may provide a valid
replacement. An example would be the EXP(base,exponent) function
which can be easily replaced with the java.lang.Math.pow(double,double)
method.
During the Stage 5 conversion
process, features will be identified which need to be implemented as
specific helpers (rather than a straight remapping of code into
Java). For each feature there will be a corresponding Java
API defined and implemented. This set of APIs will represent the
runtime libraries for the P2J system.
These libraries will maintain a consistent naming scheme and
structure. The code conversion process will rewrite code as
necessary to use these runtime APIs.
Runtime Environment - Data
Progress 4GL
Transaction Processing Facilities
All database processing in Progress 4GL is performed at the record
level. Each statement which retrieves data from the database always
returns only one record at a time. Each statement which
updates/adds a record does that by
updating/adding one record at a time. This is significantly different
from
the set-oriented approach of a modern RDBMS. This approach has
significant consequences. First of all, many things which usually are
hidden (done automatically by the RDBMS) are exposed and the Progress
4GL programmer has
control of them. This yields great flexibility but causes complexity to
grow
enormously when trying to handle complex data processing.
Depending on the implementation, set-oriented optimizations become
significantly harder to realize. In some cases such optimizations
are impossible without a logic rewrite.
The following things are
exposed to the Progress 4GL programmer:
- Table buffers. A table buffer contains some number of
records from the table.
The actual number of records is not important because only one record
is active at a time. The 4GL automatically maintains one table
buffer for each used table. The programmer can declare and use
additional table buffers if necessary.
- Transaction management. Although the 4GL tries to maintain
transactions automatically this is not always correct and/or optimal so
the programmer is encouraged to maintain transaction blocks manually.
- Locking. The 4GL tries to maintain locking
automatically but the programmer is encouraged to maintain locks
manually, especially because the default behavior may be not obvious
(see below about transactions and locking).
- Temp tables. Usually (in a set-oriented RDBMS) temporary tables
are completely internal and are not visible to user. But the 4GL allows
the programmer to create and maintain temp tables because this enables
more complex queries and data processing. It appears that temp tables
are created at the level where they defined and destroyed when the
containing block goes out of scope. Usually temp tables are
defined at the beginning of the procedure and destroyed at the end of
procedure.
- Handling tables as variables. The 4GL allows tables to be
passed by value and reference as a parameters to a procedure.
When tables are passed by value, table content is copied into a formal
parameter. This can be a very time and resource consuming
procedure. Passing tables by reference (handle) resolves the issue but
requires attention from the programmer because table content may be
changed by the called procedure.
Progress queries can be implicit when certain fields are referenced or
based on certain language keywords/facilities (e.g. certain looping
constructs automatically query the database and iterate over
records).
These implicit database accesses must be converted to explicit use of
the data model in the target environment.
Automatic Joins
In Progress, database queries can be nested arbitrarily.
Depending on
the structure of the query, Progress may or may not
automatically join these into
a single query. The data model must handle all such
situations of
arbitrary nesting of queries, including the situations in which queries
are automatically joined.
The following are examples of situations in which Progress does support
automatic joining (including inner and
outer joins). From the 'Progress Programming Handbook':
FOR EACH Customer WHERE State = "NH",
EACH Order OF Customer WHERE
ShipDate NE ? :
DISPLAY Customer.Custnum
Name OrderNum ShipDate.
END.
or
DEFINE QUERY CustOrd FOR Customer, Order.
OPEN QUERY CustOrd FOR EACH Customer, EACH
Order OF Customer.
or even
DEFINE QUERY CustOrd FOR Customer, Order.
OPEN QUERY CustOrd FOR EACH Customer, EACH
Order OF Customer OUTER-JOIN.
Such simple joins can be directly mapped into SQL. This may have
an impact on the implementation and
optimization of the data model (specifically the implementation of
methods exposed to the data model's clients).
Transactions
- It appears that transactions in the
4GL assumes changing/rolling back ONE RECORD IN EACH TABLE involved in
the
transactions. In other words, if in one transaction block more than one
record changed then during roll back (UNDO in 4GL terminology) only one
(current) record is rolled back. So, if there is an statement which
processes many records (for example FOR EACH) then each update of the
database is assumed as a separate transaction. There are many signs
that
such an assumption is correct but it should be verified on real 4GL
installation.
- Transactions in 4GL can affect table fields and variables.
Variable or temp-table should be explicitly defined as 'NO-UNDO' if
rolling back
of transaction should not restore variable value.
- Transaction bounds can be controlled implicitly or explicitly.
Implicit transaction is started by following statements if they
DIRECTLY
UPDATE DATABASE:
- FOR EACH block.
- REPEAT block.
- Procedure block.
- DO blocks with ON ERROR or OR ENDKEY qualifiers.
- It should be noted that 4GL may propagate start of transaction to
upper block level, up to procedure level (making entire procedure
executing as a
single transaction). Propagation rules are described in the 'Progress
4GL Handbook'.
- Addition of the TRANSACTION keyword to DO, FOR EACH or REPEAT
block makes its boundary defined explicitly.
- Each statement mentioned above can be marked by label and this
label then can be used as parameter for the UNDO statement (see below)
to specify transaction to roll back.
- Transactions can be nested just like blocks, which specify
transaction boundaries, can be nested.
- Transaction does lock management on behalf of application but
following note should be taken into account (citing the 'Handbook'):
- "Never read a record before a transaction starts, even with
NO-LOCK, if you are going to update it inside the transaction. If you
read a record
with NO-LOCK before a transaction starts and then read the same record
with EXCLUSIVE-LOCK within the transaction, the lock is automatically
downgraded to a SHARE-LOCK when the transaction ends. Progress does not
automatically return you to NO-LOCK status."
- At present it is not known if the application programmers took
this into
account or not. In either case this semantic should be implemented or
taken into
account during conversion.
- UNDO-ing (rolling back) transactions can be done using UNDO
statement. If transaction label is not specified then innermost
transaction containing
block with error property is rolled back. Otherwise transaction which
starts at specified label is rolled back.
- Unlike other systems 4GL automatically supports retrying or
skipping of failed transactions, i.e. in case of failure updating of a
record can be retried or record can be skipped and updating of next
record started. Actual action depends on UNDO parameters. This
specific behavior also should be taken into account during conversion.
Locks
- Locks are not applied to temp tables because they are completely
internal for the procedure which contains it.
- Each retrieved record automatically receives SHARE-LOCK. It does
allow others to read record but any try to apply EXCLUSIVE-LOCK (for
example for update) will fail.
- EXCLUSIVE-LOCK can be applied explicitly (by passing options to,
for example, FIND statement).
- By passing NO-LOCK option one may read any record (even
incomplete transactions).
- Transaction automatically upgrade lock from any level to
EXCLUSIVE-LOCK and then return it to SHARE-LOCK.
- Default locking mode for queries with associated browse is
NO-LOCK.
- RELEASE statement does release lock (locking mode is set to
NO_LOCK).
Data Model
The data model is a set of objects which provide a native Java view of
the business data in the application. This business data may be
comprised of local variables and/or data stored in the database.
For each type of data used in the Progress 4GL environment, there must
be an equivalent representation in the Java language. This
corresponds to a set of Java objects that must be designed with
suitable facilities to replace the matching facilities in the Progress
4GL.
The P2J environment has a strong separation between business logic and
the data model. The business logic is the client or user of the
data model. The data model represents a native Java access and
storage mechanism for data created and used by the business
logic. The objects in the data model provide a gateway from the
object-oriented Java environment to the Relational
Database Management System (RDBMS) environment. This mapping
is defined during Stage 3 of the Data/Database Conversion
process. These objects use the Object to Relational Mapping
(O2R Mapping) technology Hibernate to implement this gateway.
The data model must support facilities designed to replace the Progress
4GL implementation of transactions, locking and undo as summarized in the previous
section.
There will be an transaction oriented interface built into the data
model which is available to business logic. This interface
will be scoped to only include committing or rolling back changes to
the specific data model object in question. This level of
transaction support is called an "Application Level Transaction" in
Hibernate but it will be referred to as a "data model
transaction". This level of transaction support is not equivalent
to database-level transaction support (SQL commit or rollback).
Database level
transactions are very short lived. For example, one may start a
transaction, read a result set and then close the
transaction. However in an application level transaction,
one may read a purchase order (under the covers it executes the
previously
described database transaction) and then one may edit some of the
purchase
order's details and finally save the purchase order. When the
save occurs,
a second database level transaction will occur bracketing the SQL
update. So a single data model object (e.g. an order) level
transaction may yield multiple database level transactions.
The following diagram illustrates this example flow:

Any particular data model transaction might be long lived (compared to
the very short duration of the database level transactions). In
fact, it
might take minutes to complete in a situation in which the user pauses
in the middle of a specific application screen. So data model
transactions and database transactions have different granularities and
life spans. To the degree that is practical, the business logic
will not have direct access to the database-level transaction
control. However there are optimization considerations that may
influence the final results in this area.
The functionality of Progress 4GL locking controls must be replaced
with equivalent mechanisms in the data model. These will either
be mapped directly into RDBMS locking concepts or will be implemented
in the data model itself.
There may also be a business level transaction capability in which the
updates to multiple data model objects occur as an atomic entity.
This is not part of the data model itself, though the data model must
be designed to enable it.
This transactional support must be carefully designed to handle a wide
range of business situations.
Closely related to the transactional support is the multi-level undo
functionality. The use of undo is controlled by the block
structure of the Progress procedures and the associated implicit or
explicit properties associated with each block. Depending on the
nesting of these blocks, there may be a complex multi-level set of
transactions including a concept of primary transactions and 1 or more
sub-transactions. The undo implementation by default is
associated with the current transaction/sub-transaction but by labeling
undos the association can be changed to an arbitrary point in the
levels of nested blocks. This means that one can undo multiple
levels up or just one level up at a time.
The data model must provide for an equivalent of
the
Progress 4GL capabilities regarding undo. The data model
must handle
all kinds of variables, not just those backed by the
database. This is because even the simple variables support the
Progress concept of undo. There is a
set of Java classes that back each data type. For each Progress
4GL datatype a specific mapping needs to be made into a Java primitive
or class, although the undo feature may prohibit the chance to directly
use primitives.
Some of the Progress functionality related to undo is designed to
provide a standard flow control mechanism. In other words, undo
can be used in Progress to change the point of execution in a program
as well as to restore variables or database fields to a specific prior
state. The flow control aspects of undo will be handled by other
parts of the runtime and code conversion as this is quite separate from
the variable/database rollback. There may be cases where undo is
being used for both flow control and variable/database rollback.
However it is also possible that a particular use of undo ignores the
results of the intermediate levels of variable/database rollback.
Detecting this situation may be an important optimization as it may
allow the elimination or reduction of levels of undo to be
supported. This would reduce working set and CPU utilization.
Security will be an important feature of the data model. This is
a critical location because it is the gateway to all business
data. There are
2 types of security mechanisms that will be supported in the data model:
- Whether access to a specific data object is allowed at all.
- Filtering of the results that are valid for this user.
To implement such security capabilities, any such constructs in the
Progress 4GL must be detected and converted into a set of rules that
can then be processed in a generic manner by the data model. In
other words, the data model will implement a standard layer of access
control and filtering. This layer will consult a set of rules
(access control lists or ACLs and filter descriptors) describing the
policies that must be enforced by the data model based on the runtime
context (e.g. the user) and the context in the data model (which
resource is being accessed). In order to implement a generic
approach to filtering, it is likely that an expression language and
expression processing engine will be necessary. The filter
descriptors would be rules written in this expression language and the
expression engine would evaluate the expression for each record of a
result set to determine if the record could be accessed in the current
context.
To optimize filtering performance it will be important to leverage any
possible conditions that can be placed on queries to reduce the result
sets based on known security parameters. This means that to
optimize performance, data model filtering may be implemented in 2
passes:
- Where possible, generate a query to the database that eliminates
as much of the result set as possible, if those results would be
eliminated by the filter descriptors active in this context. This
will be dependent upon:
- Whether the expression operands map directly to columns in the
specific tables being used.
- Whether the expression (or a sub-expression) can be completely
and accurately specified in SQL.
- Run the expression engine against the final (smaller) result set
to handle those parts of the expression that could not be delegated to
the database via SQL.
By maximizing the filtering that will be done by the database, the
overall performance can be maximized. Note that we may wish to
leverage Hibernate's filtering and query capabilities to perform this
filtering at a higher level of abstraction, rather than managing the
SQL directly.
Detailed Design Process:
- Document the detailed requirements for the data model.
Include a complete treatment of the following:
- List of Progress 4GL data types that must be supported.
- Features/specifications of each data type.
- Detailed transaction functionality that must be supported.
- Detailed database locking functionality that must be supported.
- Detailed undo functionality that must be supported.
- Design the object hierarchy for the data model. Specific
issues needing resolution include:
- Are we going to use the Data Access Objects pattern? If
so, to what degree?
- To what degree can/should the data model be made to be
independent of the object to relational mapping layer (Hibernate)?
- Where should the Hibernate specific features be
implemented?
(In a superclass, as an Interface that must be implemented or in
specific/parallel Hibernate subclasses)
- To what degree is the Java Data Objects specification useful?
- Design the approach to transaction processing, locking and undo.
- Define the list of patterns in which Progress 4GL code implements
data model security (variable or database access control, result set
filtering).
- Make a complete list of the application source code that
implements data model security.
- Design the logic and locations for implementing access control
security in the data model.
- Define the expression language and requirements for the
expression processing engine. Use the Java Trace Analyzer (JTA)
expression language and engine as a starting point.
- Design the logic and locations for implementing filtering in the
data model.
- Define the format and storage/access mechanism for the:
Data Storage
All persistent data storage is in a Relational Database Management
System (RDBMS). This provides a well structured storage and
access environment which can be accessed via standard Structured Query
Language (SQL). The use of this approach will be kept
generic. No database-specific features will be used, except to
the extent that these are encapsulated and managed transparently by
Hibernate's database dialect abstraction layer. This will allow
the implementor of the P2J environment to choose the database based on
implementation requirements rather than driving the database choice
based on artificial limits in the P2J source code. It must be
equally possible to use open source database technologies such as MySQL
or PostgreSQL, as it is to use the commercial databases such as DB2 or
Oracle.
PostgreSQL is the database of choice for the P2J
implementation. It is important to test using this environment,
but other environments must also be tested to ensure that the SQL
implementation is generic.
Detailed Design Process:
- Ramp up (acquire basic skills) on PostgreSQL RDBMS. Create
a "cheat sheet" of critical processes and commands so that others can
implement each of the below areas without first reading all manuals or
becoming a PostgreSQL expert. Focus on:
- installation
- configuration
- security
- startup/shutdown
- JDBC connectivity
- any
command line interface
- administrative utilities
- database structure
creation/editing
- import/export of data
- transactional logging/recovery
- backup
- Setup PostgreSQL in development and test environments. Make
sure that an adequate number of such environments exist which can
properly support the size of the development team (4-8 people).
Object to Relational
Mapping
Accessing data in a relational database (typically via JDBC) does not
produce a set of native Java objects as a result. In order to
provide simple native database access, an Object to Relational Mapping
(ORM) solution is used. This technology provides the
infrastructure to expose the database as a set of Java objects.
Another common name for ORM solutions is "persistence frameworks" since
the technology is used to store data to and retrieve data from a
persistent storage mechanism (a database).
Hibernate v2.1 was chosen as the ORM implementation for this project
for a number of reasons:
- functional completeness;
- good performance and tuning/optimization capabilities;
- broad and transparent support (via its Dialect mechanism) for
multiple database implementations, both commercial and free;
- a pragmatic yet elegant architecture, which provides an unusual
combination of good flexibility with minimal impact on object design,
and sensible default behavior;
- widespread acceptance in the Java developer community, which
speaks well for reliability and a well-tested code base.
This open source framework was developed by community developers led by
JBoss team members. It is made available via LGPL. Community input and
support seems strong, though paid support and developer training
contracts are available from JBoss as a last resort. The free
documentation is decent but leaves a number of questions unanswered.
The
published API seems well documented via JavaDoc, though the internal
implementation classes are often completely undocumented. Several
excellent books are available for Hibernate:
- Hibernate - A
Developer's Notebook - James Elliot (OReilly), ISBN
0-596-00696-9:
an excellent practical tutorial, light on theory (by design)
- Hibernate In Action
- Christian Bauer, Gavin King (Manning), ISBN 1-932394-15-X: written by
Hibernate's lead documenter and developer (respectively), this book
does
an excellent job plugging the holes in the online docs, giving lots of
ORM theory along with practical advice. Two thumbs up! <g>
Hibernate is designed to work best with "Plain Old Java Objects" and
thus does not require persistable objects to extend or implement
framework-specific classes or interfaces, respectively. It is not
completely transparent, however; data model objects will have some
minimal persistence-related design requirements (at minimum a database
identifier field for most persistable classes). Nevertheless, we
believe
the benefits of this framework far outweigh these inconveniences, which
we expect would be required in any ORM approach.
Hibernate's SchemaExport (hbm2ddl) tool will be used to generate the
Data Definition Language (DDL) and apply it to the target database.
This
isolates us from differences in implementation between database
vendors, significantly simplifying both the abstract definition of the
schema and the creation of DDL. Initially, one Hibernate mapping
document (*.hbm.xml) will be created per relational table. This
approach
may be modified iteratively as we refine our understanding (and schema
hints documents) of the relations between Progress tables.
We will use Hibernate's XML configuration file to configure the
framework, which is the preferred mechanism according to its designers.
Object to table mappings for use at runtime will be configured using
the
*.hbm.xml files used for schema creation. It may also be necessary to
dynamically configure additional (and likely temporary) mappings at
runtime in order to mimic support for Progress' temp table mechanism.
Connection pooling technology will be implemented to mitigate the
overhead of creating new JDBC connections for each database request.
Hibernate implements a plug-in architecture for connection pooling and
manages the connection pool transparently. A default connection pooling
implementation is provided, but it is intended for development use
only,
not deployment. The top, contending, production-level technologies (all
open source) which are currently supported by Hibernate are C3P0 (ships
w/ Hibernate), Proxool, and Apache's DBCP. We have used previous
(satisfactory) experience with C3P0.
Second level caching will be used to minimize database round trips
wherever possible, primarily with read-only and read-mostly data which
is least likely to become stale when cached. For example, a good
candidate for setting up a read-only second level cache is for the
table
to which we map metadata associated with Progress database fields,
which is used by the user interface (format clauses, labels, help text,
etc.). Second level caching is optional, but for all practical purposes
is probably necessary for the P2J system to scale well. Hibernate
implements a plug-in architecture for second level cache. Candidates
here are EHCache (the default), OSCache, SwarmCache, and JBossCache.
Where it is possible and feasible to extract sufficient data query
information from Progress source code, these queries will be mapped to
compile-time, named queries using HQL (Hibernate Query Language). HQL
provides a mechanism to generate polymorphic queries and to perform
more
intuitive substitutions than prepared statements (which are used under
the covers). It uses an SQL-like syntax which refers to object-level
constructs rather than database table-level constructs. Using named
queries (i.e., those stored outside Java code in XML and accessed by
name at runtime) makes for easier query maintenance and cleaner code.
Where the use of HQL is not practical or possible, we will use native
SQL queries (which can also be named and maintained outside Java code).
A possible candidate for this approach is P2J's implementation of
Progress temp table support. If implemented to leverage temp table
support in the backing, relational database, this will require the use
of syntax which HQL does not support (e.g. SELECT INTO TEMP...).
Hibernate's Criteria API will not be used, unless we find a compelling
need, due to constructs in Progress source code which are not easily
replaced otherwise.
Detailed Design Process:
- Investigate connection pooling technology candidates. Review the
types of open source licenses associated with these technologies.
- Investigate second level cache technology candidates. Part of
this task is to review the types of open source licenses associated
with these technologies.
Runtime
Environment - User Interface
Character Mode User
Interface
The P2J user interface (UI) architecture is designed to offer multiple
client types (e.g. Swing, Web, ASCII Terminal) by using a a pluggable
front-end that handles the client-type specifics.
P2J will initially provide a complete replacement for the terminal
oriented user interface of the Progress 4GL environment. This
character mode UI (sometimes referred to as "CHUI") will have multiple
components:
- CHARVA
- It is an LGPL project written to provide a Swing-like interface
for Java based CHUI applications.
- http://www.pitman.co.za/projects/charva/
- Modifications to this code may be needed to resolve defects or
to modify the look and feel. The documentation suggests that much
of the look and feel changes can be done by subclassing which may be
separated into our own code.
- Because the interface is highly compatible (almost completely
compatible) with Swing, adding Swing support later will be
significantly easier.
- Sub-Components:
- Java classes that provide the interface and implementation of
a character mode infrastructure. Most of the code in these
classes is pure Java, but a small amount relating to platform-specific
terminal interfaces will require native methods. These classes
are packaged in a charva.jar file.
- A JNI library backing the native methods of the Java classes
referenced above. This library is implemented on Linux.
- Progress 4GL Compatible CHUI Widgets
- These are classes that implement the set of specific UI
controls and capabilities that exactly match the Progress 4GL CHUI
environment.
- The exact look and behavior must be maintained completely with
the intention that a user would not know the difference between the
original Progress application and the new Java version.
- Client Application Launcher
- This provides a traditional "main()" function.
- It allows the JVM launcher process (such as the "java"
executable) to be called from a script or even to be used as a user's
login shell itself.
- This client JVM will completely implement the application's
user interface and will access the business logic and database through
a secure transaction interface.
- Application Specific Classes
- The are classes which have been programmatically converted from
the Progress 4GL source code.
- These classes represent the real application itself, however
the code will only be a subset of the original application due to the
fact that this is only the UI portions of the original Progress
application. As part of the conversion, the UI will be refactored
out of the tightly coupled approach used in Progress. No part of
the business logic or data access will be implemented at the client
level.
Future enhancements of P2J will be targeted at providing additional
client types.
Detailed Design Process:
- Make a complete list of all user interface controls and features
used in the current Progress 4GL application.
- For each control or feature, document the exact specification of
its visual look and functional behavior. This will likely require
some amount of prototyping research in the Progress 4GL environment to
"fill in" the inevitable gaps in the documentation.
- Map each control and feature into its CHARVA equivalent, where
there is one. Document all gaps as well as all required changes
or enhancements that will be needed to provide the needed
function. This will require experimentation with the CHARVA code.
- Analysis/testing of the CHARVA code in specific terminal
environments. All issues will be documented in detail. The exact
terminal software used by the client will be used for this testing and
the following terminal types will be tested:
- Implement each Progress 4GL Compatible CHUI Widget and all proper
behaviors.
- Design the process and the specific logic (flow chart +
pseudocode) for refactoring the UI code from the business logic.
This means that all 4GL UI code must be:
- identified
- categorized
- separated from business logic and data validation
- mapped into a standalone Java class or set of classes
- Develop the application launcher class (to startup the client
side).
Menuing and Navigation
Although menuing and navigation is closely related with the user
interface, it is a separate topic. P2J provides a generic
mechanism for implementing a menu-based interface that allows the user
to navigate through an application. The menus are defined in the
Directory and each user's specific access rights are likewise in the
Directory. This allows a generic menuing mechanism to operate
differently for each user.
Security will be implemented by the menu manager. Only those
entries to which a user has access will be available.
The navigation itself will have a user interface component that is
responsible for the managing the interaction with the user. It
will display the menu to the user and process the menu choices.
This user interface component must have a customizable look and feel,
such that customers can properly mimic current menu interfaces.
This processing must be locale enabled.
Implementation Process:
- Document requirements for a generic menu/navigation processing
facility.
- Define the syntax and format for the navigation rules.
Online Help
A replacement for the Progress 4GL context sensitive help system will
be built. This facility will look and feel the same (using the CHARVA
widgets) as the Progress equivalent. The key here is that
this is a
hand-built Java application that uses the widgets to implement the UI
for the online help. A back-end set of transactions for accessing
the help data must be built. This processing must be locale
enabled.
Implementation Process:
- Create a set of transactions to access specific help topics based
on context.
- Document look and feel for the UI.
- Implement the UI using the CHARVA widgets.
Data
and Database
Conversion
The following diagram illustrates the 3 stage process through which the
application data and Progress 4GL database schema will be converted:

Database Schema Conversion
(Stage 1)
Using the Progress 4GL development environment, one can export the
current database schema. This creates a set of .df text files that have
a structured definition of the Progress database schema. An early
assumption of this project was that the entire schema definition is
exported into this format, so no other data would be necessary to
perform the schema conversion. This is not the case.
In particular, the schema dump file does not contain information about
the data integrity relations between database tables. There is no
explicit information in the dump file to determine a table's primary
and
foreign keys, which are necessary to create the relational schema
correctly and ensure the data is migrated correctly. The dump file does
include information about a table's indices, one of which must be
designated the primary index. However, the semantics of a primary index
in Progress differ from that of primary keys in a relational data
model,
such that we can't map Progress primary indices directly to relational
primary keys in all cases.
Specifically, Progress primary indices are not necessarily unique.It appears that a Progress primary
index is simply a "regular" index, but with some special significance.
Most likely the "primary" designation merely ensures that the index so
marked will determine the default scanning order for the records in
that
table during a query, and that index will be preferred for this purpose
over other indices (if any) defined for that table.
The net effect of this deficiency is that the 3 stages of database
conversion described herein are actually converging with one another
and
to some degree with the code conversion area of the project. A
dependency upon the ability to scan database access code to analyze
query logic now exists. Table relation information will have to be
gleaned from automated and manual code analysis.
A schema conversion tool parses the .df files and for each element of
that file maps this to the appropriate counterpart in a relational
database structure. The tool generates an output report
indicating any issues encountered during the conversion.
A "P2R Hints" (P2R stands for Progress to Relational) file allows the
persistent specification of any deviations from the default conversion
choices. It will also contain information about the primary-foreign key
relations between tables. The schema conversion process will take the
default conversion mapping and sizes unless otherwise overridden by
entries in this hints file, and will in any event use the relational
information in that file. The default mappings for the data type,
column widths/constraints and name conversions can be overridden.
The hints file also allows the specification of new/deleted tables,
columns, indices... and other refactoring to be implemented in the
conversion process.
There is an optional "Usage Pattern Hints" file. By encoding the
Progress database's pattern of use the schema conversion can be
augmented with any additional columns or features needed to properly
support the Hibernate caching and locking strategies defined in stage
3. This file will be generated through interviews with the
Progress application developers, analysis of the application source and
analysis of the exported data. The knowledge that will be encoded
includes:
- For each table, the frequency of:
- queries
- inserts/updates/deletes
- The relative volumes of data in given tables:
- per record
- per typical transaction (query or modification)
Some data types have no fixed length columns (see below for some needed
investigations here). Note that it is likely that only character
data has this behavior. In addition, there is a "hard limit" on
the total record size (the cumulative size of all columns in a record)
of 32KB. In the case where no fixed column size exists, the
schema only holds hints about the expected size. To find out the
actual size the actual data records must be reviewed. Thus the
Progress database contents will be exported to .dd files. These are
structured text files that have the actual records on a table by table
basis. Analysis of these files will allow the calculation of the
current column size based on production data. In a Relational
Database Management System (RDBMS) most columns require a real maximum
to be set in advance. Calculations from the actual data,
heuristic rules and overrides in the P2R hints file will provide the
means to establish these limits.
Progress does not have a timestamp data type. This means that
timestamps are really implemented by each application developer and
most
applications can have multiple different approaches to handling
this. For example:
- A date field + an integer that holds seconds past midnight.
- A date field + a string.
- A string.
- A decimal.
This is not an exhaustive list. Since this can be done in so many
ways and there are no standards, it may be quite challenging to
implement a conversion to a timestamp data type. At a minimum,
the
brute force approach of leaving the code as is and running the
equivalent in Java will provide the same capabilities as today.
However,
this may forego opportunities to result in a cleaner application by
converting a more complicated approach into a common, standard
timestamp approach. If there is enough commonality in the
patterns of how timestamps are done that we can detect and convert this
code in a reasonable manner, then we will implement such an
approach. Note that this has a big impact on all stages of the
data/database conversion process as the schema, the data conversion and
the data model must all be aware of this non-default conversion.
This may also be a good example of an override in the P2R Hints file.
The output from the schema conversion tool is as follows:
- Progress Schema Namespace
- This is a list of all database, table and field names that the
Progress 4GL application can reference.
- It will be used by a simple lookup class to determine if a
specific string is a valid schema reference.
- This lookup class will be used by a predicate action in the Lexical
Analyzer (Lexer) to properly handle the ambiguity of
differentiating
the statement termination character from the database schema element
qualifier (both use the '.' period).
- P2R Mapping
- This is a primary output of the schema conversion.
- It specifies enough information to allow the following
downstream processing:
- Creation of the Data Definition Language (DDL) necessary to
define/create the database on an automated basis (possible intermediate
input into Hibernate mapping document creation).
- Reading the individual records from the exported data files
and converting the Progress format into the chosen RDBMS format.
- Determining the proper ordering and maintaining referential
integrity when bulk loading the converted data into the RDBMS.
- Creation (with additional inputs) of a set of appropriate
object views of the database (the data
model) based on the Progress database references in the application
source code.
- Conversion of all Progress database references in the
application source code into the proper resulting Java code to access
the data model to get the same results as the original Progress 4GL
source code.
- Hibernate Mapping Documents
- A primary output of the schema conversion.
- Allows use of the Hibernate SchemaExport tool to create and
apply to a variety of database dialects the Data Definition Language
(DDL) necessary to define the database on an automated basis.
- Data Model Hints
- There may be hints recorded to assist or enable optimal data
model conversion.
- Issue Report
- This is a list of all issues and/or ambiguities encountered in
this run of the conversion tool.
Detailed Design Process:
- Determine if the variable column width extends to all data types
(e.g. do integers have unlimited size).
- Determine if there is any inappropriate reliance upon the
unlimited column size feature of Progress. Specifically, check
for
situations in which a particular field of a specific record is being
used in an ever growing never shrinking manner to accumulate a log or
other history-like data. If so we will need to either replace
this approach manually or design a method to detect and convert such
situations automatically. The choice of approach will depend upon
how many instances of such use there are and how uniform these
instances are.
- The format of the .df schema files is documented in the data_df.g
ANTLR grammar to a degree sufficient for parsing and creating a
hierarchy of objects which represent the application schema. The meaning and
purpose of the particular keywords/properties must yet be documented.
- Reverse engineer and document the format of the .dd data export
files.
- For each schema element, determine and document the proper
mapping(s) in a standard RDBMS environment.
- This must consider all aspects of the schema conversion
including:
- data type
- column size
- database/table/record structure
- table relations
- primary/foreign keys
- use of surrogate primary keys whenever possible (e.g. to
replace composite primary keys or other primary keys which have
business
meaning)
- indices
- table/column constraints
- naming
- Make a note of any mappings that are ambiguous (where one may
not be able to automatically choose the correct mapping).
- Define the rules for converting Progress names to names
appropriate in a standard RDBMS environment, where it makes sense to
change the names.
- Identify the list of places in the application that use timestamps.
Categorize the number of discrete timestamp implementations. Make
a determination of the feasibility of automatically identifying and
converting each implementation pattern.
- List the requirements for the P2R Hints file. The list of
specific overrides and refactoring features must be documented.
- Define the format of the P2R Hints file.
- List the requirements for the Usage Pattern Hints file.
- Define the format of the Usage Pattern Hints file.
- Extend the format of the P2R Mapping output file prototype.
This needs to contain a complete, bidirectional mapping between every
Progress schema element and the final RDBMS equivalent.
Data Conversion and RDBMS
Creation/Loading (Stage 2)
The first step in the data conversion stage is to create the relational
database. DDL must be generated using the P2R Mapping output from
Stage 1. That DDL is then submitted to the proper database via
JDBC. Any issues that occur during the creation process need to
be
reported clearly and completely.
The second step is to read the data from the exported .dd text files
and convert each field of each record from the Progress format into the
format needed for the relational database. While most string data
may not need modification, numeric data, timestamps or data in other
formats might need conversion or transformation before it can be loaded
into the RDBMS. Most important is that this processing must
maintain the correct referential integrity in the target database, such
that all record level relationships are maintained. In instances
where there are new or different primary keys, the conversion
code
or the target database itself must generate these keys. Of
course, all foreign keys that refer to these new or modified primary
keys must be kept synchronized (changed in "lock-step").
The
data will be converted on a per-record basis, with the process starting
at a root point in the tree of dependent tables. Following this
dependency tree, a graph of related record data will be read and
converted into the target format.
The third step is to take the converted data and load the RDBMS.
While this step is displayed as a logically separate step, it is likely
that this will be done at the same time as the data conversion, on a
per-record basis. Each related graph of converted data will be
inserted into the RDBMS at the same time, thus maintaining referential
integrity. The final result will be a fully populated RDBMS with
the production data in a form ready for application access.
It is likely that the calculation of the tree of table dependencies
(that defines the order in which the tables of the database need to be
loaded) and the graph of relationships between tables (which defines
the
manner in which related records are processed to be loaded at the same
time to maintain referential integrity) will be useful information for
use in Stage 3. Caching this information in the "Data Model
Hints" may be done if this is of value.
While performance is always of some concern, the primary design drivers
of this processing are simplicity and reliability. As long as the
resulting database can be populated with the complete set of the
correct
data in the correct format that maintains all necessary relationships,
then the process is a success. Simplicity is important because
the maintenance and improvement of this processing must be easy and
less error prone.
Detailed Design Process:
- Define the specific set of rules by which one determines the tree
of dependent tables.
- Define the specific set of rules by which one determines the
graph of relationships for record level processing.
- Define the list of all data conversions that must be done based
on the P2R Mapping (output from Stage 1) and the format of the .dd
files. This task is dependent upon the documentation P2R Mapping
and the .dd file format in the Stage 1 design.
Data Model Conversion (Stage 3)
This stage has the objective of creating a set of business objects that
represent the database in a format that is easily programmed and
understood. This is called the data model
and the core idea is
that Java business logic needs a set of objects that provide a natural
interface to the database, and which abstract the mechanics of how the
data is stored and retrieved. The business
logic is "client" code
in this regard and it just uses and modifies the data model and the
data model handles the database specifics without the business logic's
involvement.
The data model is a representation of the different (and possibly
overlapping) business-defined views of the database. The
definitions of these views are stored in the Progress 4GL application
source code. Thus the core input for the data model conversion
process is the preprocessed 4GL application source code. It must
be preprocessed because until the preprocessor runs, the source file
may still contain code that is not valid Progress 4GL.
The P2R Mapping is a second input that is extremely important to the
data conversion process. This is the primary output from Stage 1
and it is needed here to allow the data conversion process to convert
the business views in the source code into the correct Java
representation that abstracts the RDBMS structure. In order to
abstract that structure, the data conversion process must have the
structure's definition and it must know the exact Progress 4GL source
of each target element.
Other inputs to the conversion process include usage pattern hints and
data mode conversion hints, both of which are created during Stages 1
and 2. These hints define overrides and other non-default
behavior in order to resolve ambiguous decisions or optimize
performance. In particular, the usage pattern hints will define
where non-default approaches are needed in regards to the caching and
locking strategies of the Object
to Relational Mapping layer.
Outputs include:
- Object to Relational Mapping (O2R Mapping)
- This is the definition of the data model in a P2J format (not
specific to the persistence framework).
- Progress 4GL to Object Mapping (P2O Mapping)
- This provides a way to determine the data model replacement for
a given Progress 4GL data structure.
- This is used as an input to the code conversion process which
creates the business logic client code for the data model.
- Code Conversion Hints
- This contains input to guide, override or simplify the code
conversion process.
- An example of the contents is a list of the parts of the
Progress 4GL application source code that can be ignored (because the
data model conversion has already completely converted or eliminated
the need for additional target code).
- Cross-reference of
Progress 4GL variables/schema elements to Java Data Model replacements
(and vice versa) for the following:
- name mapping (relative to the data model only)
- source line mapping from Progress 4GL files to/from Java files
(relative to the data model only)
Once the data model conversion process completes, the O2R mapping
output can be converted into a specific set of Object to Relational
Mapping configuration/mapping files. This will be done by a P2J
specific process and it will generate Hibernate specific output
files. In the future, if other persistence frameworks are
supported, this generation process will require changes. Note
that the O2R mapping will not need to be changed in this case.
Once the Hibernate mapping files are created, the actual Java data
model source code can be generated.
Detailed Design Process (highly dependent upon the completion of the Data Model detailed design process):
- Define the rules for converting Progress names to data model
names, where it makes sense to
change the names.
- List the requirements for the O2R mapping file.
- Define the format of the O2R mapping file.
- List the requirements for the P2O mapping file.
- Define the format of the P2O mapping file.
- List the requirements for the Code Conversion Hints file.
- Define the format of the Code Conversion Hints file.
- Document the process of writing the Hibernate-specific mapping
files from the O2R output file.
- Determine the approach for the generating code for the data
model. Can/will Hibernate utilities be used or leveraged?
Will these utilities need to be enhanced or replaced?
Code
Conversion
The following diagrams illustrate the 6 phase process through which the
Progress 4GL source code will be converted into Java:


Progress 4GL
Lexical Analyzer and Parser
Multiple components of the P2J conversion tools will need the ability
to process Progress 4GL source code. To do this in a standard
way, a Progress 4GL-aware lexical analyzer and parser will be
created. Some details on lexical analyzers and parsers:
Lexical Analyzer and
Parser Primer
By using a common lexical analyzer and parser, the tricky details of
the Progress 4GL language syntax can be centralized and solved properly
in one location. Once handled properly, all other elements of the
conversion tools can rely upon this service and focus on their specific
purpose.
Parser and lexer generators have reached a maturity level
that can easily create results to meet the
requirements of this project. This creates a savings of
considerable development
time while generating results that have a structure and level of
performance that meets or exceeds the hand-coded approach.
In a generated approach, it is important to ensure that the resulting
code is unencumbered from a license perspective.
The current approach is to use a project called ANTLR. This is a public domain
(it is not copyrighted which eliminates licensing issues) technology
that has been in development
for over 10 years. It is used in a large number of projects and
seems to be quite mature. It is even used by one of the Progress
4GL vendors in a tool to beautify and enhance Progress 4GL source
code (Proparse). The
approach is
to define the Progress 4GL grammar in a modified Extended Backus-Naur
Form (EBNF)
and then run the ANTLR tools to generate a set of Java class files that
encode the language knowledge. These resulting classes provide
the lexical analyzer, the parser and also a tree walking
facility. One feeds an input stream into these classes and the
lexer and parser generate an abstract syntax tree of the results.
Analysis and conversion programs then walk this tree of the source code
for various purposes. There are multiple different "clients" to
the same set of ANTLR generated
classes. Each client would implement a different view and usage
of the source tree. For example, one client would use this tree
to analyze all nodes related to external or internal procedure
invocation. Another client would use this tree to analyze all
user interface features.
P2J uses ANTLR 2.7.4. This version at a minimum is required to
support the token
stream rewriting feature which is required for preprocessing.
Some useful references (in the order in which they should be read):
Building Recognizers By
Hand
All I know
is that I need to build a parser or translator. Give me an overview of
what ANTLR does and how I need to approach building things with
ANTLR
An
Introduction To ANTLR
BNF and
EBNF:
What are they and how do they work?
Notations
for Context Free Grammars - BNF and EBNF
ANTLR
Tutorial (Ashley Mills)
ANTLR 2.7.4 Documentation
EBNF Standard
- ISO/IEC 14977 - 1996(E)
Do not confuse BNF or EBNF with Augmented BNF (ABNF).
The latter is defined by the IETF as a convenient reference to their
own enhanced form of BNF notation which they use in multiple
RFCs. The original BNF and the later ISO standard EBNF are very
close to the ABNF but they are different "standards".
Golden Code has implemented its own lexical analyzers and parsers
before (multiple times). A Java version done for the Java Trace
Analyzer (JTA). However, the ANTLR
approach has already built a set of general purpose facilities for
language recognition and processing. These facilities include the
representation of an abstract syntax tree in a form that is easily
traversed, read and modified. The resulting tree also includes
the ability to reconstitute the exact, original source. These
facilities can do the job properly, which makes ANTLR the optimal
approach.
The core input to ANTLR is an EBNF grammar (actually the ANTLR syntax
is a mixture of EBNF, regular expressions, Java and some ANTLR unique
constructs). It is important to
note that Progress Software Corp (PSC) uses a BNF format as the basis
for their authoritative language parsing/analysis. They even
published (a copyrighted) BNF as 2 .h files on the "PEG", a Progress
4GL bulletin board (www.peg.com). In addition, the
Proparse grammar (EBNF based and used in ANTLR) can also be found on
the web. Thus it is clear that EBNF is sufficient to properly
define the Progress 4GL language and that ANTLR is an effective
generator for this application. This is the good news.
However, there is
a very important point here: the developers have completely
avoided even looking at ANY
published Progress 4GL grammars so that our hand built grammar is "a
clean room implementation"
that is 100% owned by Golden Code.
The fact that they published these on the Internet does not negate
their copyright in the materials, so these have been treated as if they
did
not exist.
An important implementation choice must be noted here.
ANTLR provides a powerful mechanism in its grammar definition.
The structure of the input stream (e.g. a Progress 4GL source file) is
defined using an EBNF definition. However, the EBNF rules can
have arbitrary Java code attached to them. This attached code is
called an "action". In the preprocessor, we are deliberately
using the action feature to actually implement the preprocessor
expansions and modifications. Since these actions are called by
the lexer and parser at the time the input is being processed, one
essentially is expanding the input directly to output. This usage
corresponds to a "filter" concept which is exactly how a preprocessor
is normally implemented. This is also valid since the
preprocessor and the Progress 4GL language grammars do not have much
commonality. Since we are hard coding the preprocessor logic into
the grammar, reusing this grammar for other purposes will not be
possible. This approach has the advantage of eliminating the
complexity, memory footprint and CPU overhead of handling recursive
expansions in multiple passes (since everything can be expanded inline
in a single pass). The down side to this approach is that the
grammar is hard coded to the preprocessor usage and it is more
complicated since it includes more than just structure. This
makes it harder to maintain because it is not independent of the lexer
or parser.
The larger, more complex Progress 4GL grammar is going to be
implemented in a very generic manner. Any actions will only be
implemented to the extent that is necessary to properly tokenize,
parse
the content or create a tree with the proper structure. For
example, the Progress 4GL language has many
ambiguous aspects which require context (and sometimes external input)
in order to properly tokenize the source. An example is the use
of the period as both a database name qualifier as well as a language
statement terminator. In order to determine the use of a
particular instance, one must consult a list of valid database names
(which is something that is defined in the database schema rather than
in the source code itself). Actions will be used to resolve such
ambiguity but they will not be used for any of the conversion logic
itself. This will allow a generic syntax tree to be generated and
then code external to the grammar will operate on this tree (for
analysis or conversion). In fact, many different
clients of this tree will exist and this allows us to implement a clean
separation between the language structure definition and the resulting
use of this structure.
The following is a list of the more difficult aspects of the Progress
4GL:
- the ability to recognize arbitrarily shortened language
keywords as the full equivalent
- some keywords support abbreviations and there is a minimum
abbreviation that is valid (e.g. "define" can be shortened at most to
"def")
- an arbitrary abbreviation can be chosen if the keyword
supports abbreviations (e.g. def, defi, defin and define are all valid
forms of the define keyword)
- if the keyword is not "reserved", then all of these forms can
also be used as valid symbol names (e.g. the keyword variable can be
shortened down to a minimum of var but these same forms can be used as
names because variable is not a reserved keyword)
- see "Keyword Index" on page 1795, Progress 4GL Language
Reference Volume 2 for a list of the keywords, whether they can be
abbreviated, the minimum abbreviation and whether the keyword is
reserved or not
- differentiating the use of the period '.':
- end of statement delimiter
- database table.field qualifier format for specifying a
specific column in a table
- decimal point in a decimal constant
- a valid character in a filename (e.g. an include file)
- meaningless usage inside comments or a string literal
- distinguishing between functions and language statements of the
same name
- ACCUM function and the ACCUMULATE language statement
that can be abbreviated down to ACCUM
- CURRENT-LANGUAGE
- CURRENT-VALUE
- ENTRY
- IF THEN ELSE
- PROPATH
- how functions are parenthesized
- MOST functions that have parameters take parenthesis
- if a function takes parenthesis, it always requires them
- these functions are exceptions (they take parameters but do not
use parenthesis):
- AMBIGUOUS record
- AVAILABLE record
- CURRENT-CHANGED record
- record ENTERED
- record NOT ENTERED
- INPUT field
- LOCKED record
- NEW record
- IF expression THEN expr1 ELSE expr2
- these non-parenthesized parameter exceptions are commonly used
in conditional expressions such as the IF function so they are
certainly implemented as functions
- functions with no parameters (e.g. OPSYS, TODAY) never take
parenthesis (in other words, built in functions will never have empty
parenthesis) and often act like global variables (some are read-only
and some can be assigned!)
- CURRENT-LANGUAGE
- DBNAME
- NUM-ALIASES
- NUM-DBS
- OPSYS
- PROGRESS
- PROMSGS
- PROPATH
- PROVERSION
- TERMINAL
- TIME
- TODAY
- user defined functions always take parenthesis even if there
are no parameters (empty parenthesis in this case)
- some language statements have function syntax and act like
lvalues (they can be assigned to!)
- CURRENT-VALUE() = expression.
- ENTRY() = expression
- SET-BYTE-ORDER() = expression
- SET-POINTER-VALUE() = expression
- SET-SIZE() = expression
- SUBSTRING() = expression
- postfixing is sometimes used or allowed in surprising locations
- numeric literals can sometimes have a postfixed negative or
positive sign (as long as there is no intervening whitespace)
- 2 functions are coded with postfixed function names (and no use
of parenthesis!)
- record ENTERED
- record NOT ENTERED
- symbol names can include characters that are also used as
operators (e.g. the hyphen is also used as a subtraction operator and
as the unary negation operator)
- symbol names must start with alpha characters
- subsequent characters can include numeric digits and the
special characters # $ % & - _
- symbol lengths
- the maximum variable name is 32 characters (see page 3-30
of the Progress Programming Handbook)
- external procedure names are limited by the operating
system rules as well as some unspecified set of rules where Progress
applies a least common denominator approach
- internal procedure names are not bounded by any documented
rule, but the Handbook suggests there is a limit
- the keyword richness of the language leads to:
- polymorphic usage of the same language keywords for different
meaning depending on which language statement is being processed
- the very large number of language keywords means that many or
most of these must be left as unreserved keywords (the user may define
symbols that hide these keywords)
- the same symbol may appear in multiple different namespaces in
the same procedure, yet these different uses are all properly
differentiated
Based on analysis and testing of a simple Progress 4GL grammar, all of
the above problems can be solved using ANTLR.
For the implementation process, please see the detailed design
document's Implementation Plan section.
4GL Unified Abstract
Syntax Tree
An abstract syntax tree (AST) is a hierarchical data structure which
represents the semantically significant elements of a language in a
manner that is easily processed by a compiler or interpreter.
Such trees can be traversed, inspected and modified. They can be
directly interpreted or used to generate machine instructions. In
the case of P2J, all Progress 4GL programs will be represented by an
AST. The AST associated with a specific source file will be
generated by the lexer and parser.
Since most Progress 4GL procedures call other Progress 4GL procedures
which may reside in separate files, it is important to understand all
such linkages. This concept is described as the "call
tree". The call tree is defined as the entire set of external and
internal procedures (or other blocks of Progress 4GL code such as
trigger blocks) that are accessible from any single "root" or
application entry point. It is necessary to process an entire
Progress 4GL call tree at once during conversion, in order to properly
handle inter-procedure linkages (e.g. shared variables or parameters),
naming and other dependencies between the procedures.
One can think of these 2 types of trees (abstract syntax trees and the
call tree) as 1 larger tree with 2 levels of detail. The call
tree can be considered the higher level root and branches of the
tree. It defines all the files that hold reachable Progress 4GL
code and the structure through which they can be reached. However
as each node in the call tree is a Progress 4GL source file, one can
represent that node as a Progress 4GL AST. This larger tree is
called the "Unified Abstract Syntax Tree" (Unified AST or UAST).
Please see the following diagram:

By combining the 2 levels of trees into a single unified tree, the
processing of an entire call tree is greatly simplified. The
trick is to enable the traversal back and forth between these 2 levels
of detail using an artificial linkage. Thus one must be able to
traverse from any node in the call tree to its associated Progress 4GL
AST (and back again). Likewise one must be able to traverse from
AST nodes that invoke other procedures to the associated call tree node
(and back).
Note that strictly speaking, when these additional linkage points are
added the result is no longer a "tree". However for the sake of
simplicity it is called a tree.
This representation also simplifies the identification of overlaps in
the tree (via recursion or more generally wherever the same code is
reachable from more than 1 location in the tree).
To create this unified tree, the call tree must be generated starting
at a known entry point. This call tree will be generated by the
"Unified AST Generator" (UAST Generator) whose behavior can be modified
by a "Call Tree Hints" file which may fill in part of the call tree in
cases where there is no hard coded programmatic link. For
example, it is possible to execute a run statement where the target
external procedure is specified at runtime in a variable or database
field. For this reason, the source code alone is not
deterministic and the call tree hints will be required to resolve such
gaps.
Note that there will be some methods of making external calls which may
be indirect in nature (based on data from the database, calculated or
derived from user input). The call tree analyzer may be informed
about these indirect calls via the Call Tree Hints file. This is
most likely a manually created file to override the default behavior in
situations where it doesn't make sense or where the call tree would not
be able to properly process what it finds.
The UAST Generator will create the root node in the call tree for a
given entry point. Then it will call the lexer/parser to generate
a Progress 4GL AST for that associated external procedure (source
file). An "artificial linkage" will be made between the root call
tree node and its associated AST. It will then walk the AST to
find all reachable linkage points to other external procedures.
Each external linkage point represents a file/external procedure that
is added to the top level call tree and "artificial" linkages are
created from the AST to these call tree nodes. Then the process
repeats for each of the new files. They each have an AST
generated by the lexer/parser and this AST is linked to the file level
node. Then additional call tree/file nodes will be added based on
this AST (e.g. run statements...) and each one gets its AST added and
so on until all reachable code has been added to the Unified AST.
The resulting UAST is built based on input that is valid, preprocessed Progress 4GL source code.
This is used as input to both Stage 2 (Dead Code
Analysis) and Stage 3 (Annotation). Stage 3 modifies the Unified
AST to add "annotations". Annotations are flags or other
attributes attached to specific nodes that provide more information
than just the node type and contents.
The output from Stage 3 is an Annotated Unified AST. The
structure is the same as the Unified AST, but there is much more
information stored in the Annotated Unified AST. This Annotated
Unified AST is an input to Stages 4 (Structure Analysis) and 5 (Code
Conversion).
All of these different progressions of the Unified AST are related to
and based upon the 4GL application source. Since this UAST
represents the source side of the conversion process, this document may
refer to the 4GL UAST as the "Source UAST".
Detailed Design/Implementation Process:
- Make a full list of possible call node types. This means
that all possible methods of program invocation must be documented
including enough information to build a pattern recognition module to
appropriately match such instances.
- Document the requirements for data that must be captured and/or
derived and subsequently stored in each call node. Each node must
store any generic data in addition to type-specific data.
- Analyze the list of indirect or otherwise troublesome program
invocation constructs in the 4GL application source code.
- Design the interface for the call tree level nodes.
- Design the mechanism for "artificial linkage" between the call
tree and each node's AST. This mechanism must allow the seamless
traversal between the 2 levels. It is this facility that allows
this to be called "unified".
- Design a hint format to store enough information to identify all
such troublesome situations as well as to "patch" the call tree to
traverse such
situations.
- Design a persistence mechanism for Unified ASTs. XML may be
an appropriate technology to leverage since it easily represents tree
structured data.
- Implement the UAST call tree nodes and the associated AST node
modifications.
- Implement the call tree creation logic to walk the source tree
and create a UAST using the lexer/parser output on each source file to
create the associated AST. Ensure that all linkages are properly
created.
- Implement persistence.
Java Unified
Abstract Syntax Tree
The Java Unified Syntax Tree is based on the same UAST concept
described in the preceding
section.
Where the Source UAST represents a call tree of 4GL source code, the
Java UAST represents the converted set of Java class/interface files,
resource bundles, expressions, rules and other output from the
conversion process. For this reason, we may refer to this UAST as
the "Target UAST".
Where the Source UAST is built by lexing/parsing the 4GL source code,
the Target UAST is constructed in memory through multiple stages
(Stages 4 and 5) and has no filesystem backing in Java source
code. For this reason, the nodes in the Target UAST are not based
on ANTLR derived objects but instead are created on a custom basis for
P2J. These nodes represent the semantic equivalent of Java
language constructs and other resources that in total represent the
converted Java application equivalent of the Source UAST. To be
clear: while in the Source UAST each node represents some compilable
4GL source code, in the Target UAST, each node represents the
conceptual Java construct but cannot be rendered into a source code
form without further processing (and formatting). Thus a node
might represent a Java language "for loop" and it would store the
minimum information necessary to describe the loop, such as the
variable initializer(s), the expression that tests for completion, the
expression and assignment that executes at loop end. It would
also have a subtree of other nodes that represent the for loop contents.
Stages 4 and 5 use analyze the Source UAST (after it has been
annotated) and build up a representation of the converted Java
application, in the form of the Target UAST. By handling all this
separate from the actual source code output and formatting, these tools
can focus on getting the logical equivalence correct. In
addition, this architectural design makes multiple passes significantly
easier. If the output of Stages 4 and 5 were Java source code,
then multiple passes could only be enabled by editing the source code
or parsing it back into a tree form.
The Target UAST will be populated in Stage 4 with structural
information (classes, interfaces, methods, data members...), naming
information and linkage information (method signatures...).
In Stage 5, this skeleton structure is then "fleshed out" with the
actual replacement function.
This split in the building of structure versus the generation of the
details is necessary because the resulting programs are being
refactored into separate components (UI is separate from business logic
which is separate from the data model...). In order for these
components to work together (to handle the logical equivalent of the
previous, highly interdependent 4GL approach), the structure, naming
and linkage between the parts must be known in advance even when the
actual client components that call these interfaces have not yet been
generated).
So in Stage 5, the actual Java language constructs (and other
supporting facilities like resource bundles or generic
rules/expressions) are added to the Target UAST. When this UAST
is complete, the actual output of the conversion can be prepared (Stage
6). Stage 6 actually writes out the compilable Java source code
(and other resources or rules) that is represented in a more abstract
manner in the Target UAST.
Detailed Design/Implementation Process:
- Design and implement the generic base class(es) that will be
needed to
construct the tree. Specific language constructs (see Stages 4
and 5) will be handled as subclasses of these base classes. Make
sure that all standard behavior and data to be stored is handled by the
base classes.
Tree Processing Harness
Since multiple code conversion stages (3-6) require processing an
entire UAST, the complexity of walking the tree is centralized in a
"Tree Processing Harness". This harness provides the following
capabilities:
- Provides a collection of those nodes of the tree that meet some
criteria (based on annotations, node type or other node data).
- Calls a client-specified callback method for each node in the
tree or a subset of nodes (as in #1 above).
- Provides an option to ignore processing a section of the tree
that represents a recursion or more generally any call to a branch of
the tree that has already been processed.
Detailed Design/Implementation Process:
- Design the public interface for the tree processing harness.
- Implement the tree processing harness.
Progress 4GL Preprocessor
(Stage 1)
The objective of this stage of code conversion is to process the entire
4GL source tree and resolve/expand all preprocessor directives
in order to turn an arbitrary source file into a valid Progress 4GL
source
file. In general, the assumption must be that a file is
not syntactically valid Progress 4GL until after the preprocessor
runs.
The Progress 4GL environment uses a purely interpreted execution
model. While this has significant performance limitations, it
does
allow for some advantages over a compiled environment (e.g.
Java). External procedures can be written with preprocessor
references to named or positional arguments (not to be confused with
parameters which are runtime entities that exist on the stack).
These arguments can be static (determined by the source code) or they
can be dynamic (determined at runtime through logic, calculation, user
input or other data input sources such as database fields). All
static references can be resolved in a standalone preprocessor (such as
P2J has). However, any dynamic arguments cannot be known in
advance and thus there can be preprocessor constructions or directives
which cannot be resolved by a standalone preprocessor. Such
external procedures cannot be precompiled in the Progress 4GL
environment. However at runtime, when all such information is
available for the interpretation of the RUN statement, the external
procedure is preprocessed just before it is converted into
R-Code. The interpretive model enables a "late" preprocessor
which is
what allows this feature to be provided.
- Note that such problems do not affect include files as all such
references are statically available in this case. Everything for
an include file is processed as a string where expressions such as
value() can be used in RUN statements.
- A valid use for this feature is to make template code for
something that can otherwise only be referenced statically. An
example is a database name (e.g. customer.address). These
references are statically compiled and are not processed at
runtime. By using an expression to generate the name
customer.address and passing it as an argument to an external
procedure, you don't have to hard code the static customer.address
anywhere. Instead you can derive, calculate or query this value
at runtime. The key point here is you can't pass such a static
value as a parameter because parameters are processed on the stack at
runtime and database names are processed at compile time statically.
- Include files do not solve this particular problem because for
each database name you still would need to have a separate external
procedure in which the specific database name is statically
defined. Thus while some of the logic could be made into a
"template" and centrally maintained, there are parts that could not be
handled dynamically at runtime.
- The feature can be abused and is considered an abuse in cases
where a parameter would have worked (in other words, where the
resulting reference could have been handled dynamically at
runtime). Evidently we may find some such situations.
Some of these cases will be converted in a programmatic manner.
For example, the xt/pushhot.p is a very simplistic external procedure
that uses positional arguments. In this case, the problem can be
resolved by converting the RUN statement into an inline preprocessor
directive (see xt/x-login.p for an example of one such caller in which
this can be done). Other cases may take more complex conversion
where the Java result is written to be equivalent in function while the
logic is very different. Yet other cases may remain that will
require manual intervention.
The preprocessor itself is designed to handle a single external
procedure (and an arbitrary number of included files). To
feasibly preprocess the thousands of 4GL source files in a given
project, a Preprocessor Harness will be created. This harness is
responsible for properly managing the input to the Preprocessor and for
driving the preprocessing of an entire source tree in a batch
mode. To do this the harness must have access to the original or
"raw" 4GL source code. In addition, in order to provide the
proper arguments and options to the Preprocessor, the harness has a
Preprocessor Hints file. This file is manually created and it
stores such input. The Preprocessor Hints file may also
include lists of source files which do not require processing.
These "ignores" may be listed due to manual conversion or they may be
manually inserted due to the code being found to be dead in the Stage 2 analysis.
The Preprocessor is created based on a grammar which is used by ANTLR
to create a set of lexers and a parser. This grammar is being
implemented with all of the preprocessor logic embedded in Java based
actions attached to specific rules as necessary. This usage of
ANTLR is considered proper for the implementation of a filter. A
preprocessor is simple enough to be implemented as a filter and such an
implementation avoids the complexity of an undefined number of
iterations walking through the entire syntax tree. Instead the
expansion of each preprocessor feature is done by the lexer as it
tokenizes. Then this result is just written to output and the
next token is processed. When the lexer is done, the stream has
been processed and all preprocessor expansion is complete.
The Preprocessor conceptually does the following:
- Uses an ANTLR generated set of 3 Lexers (to tokenize) and 3
Parsers (to
syntactically check and structure the input).
- Implements a filter approach where the input stream (a text file)
is preprocessed and written to an output stream (stdout or an output
file).
- At a high level, it is byte for byte compatible with the
following Progress Preprocessor features:
- special character combinations are expanded wherever they
appear in the source files
- each one converts into a replacement symbol
- these are also referred to as alternative codings
- see this table for a list of
the supported chars
- comments
- nested comments are supported
- comments are passed through to the output unchanged
- comments can be placed inside preprocessor directives without
causing a failure
- string literals
- preprocessor references (using curly braces) are expanded
- tilde prefixed escape sequences are left behind
- the resulting string is pass through to the output
- preprocessor variables are created and destroyed using:
- &GLOBAL-DEFINE name
replacement_string
- can be abbreviated down to &GLOB
- handles line continuations using tildes
- creates a variable in the global preprocessor namespace
- &SCOPED-DEFINE name
replacement_string
- can be abbreviated down
to &SCOP
- handles line continuations using tildes
- creates a variable in the scoped preprocessor namespace
- &UNDEFINE name
- include file or run statement arguments are added to the
namespace
- a scoped symbol dictionary is maintained with these names
- argument references are expanded
- special preprocessor symbols are replaced
- these are essentially like global preprocessor variables that
always exist
- the following are supported:
- BATCH-MODE
- FILE-NAME
- LINE-NUMBER
- OPSYS
- SEQUENCE
- WINDOW-SYSTEM
- referenced using {&name}
- include file references
- are expanded into the contents of the include file (if it
exists)
- the contents are preprocessed recursively, including any
includes or expansions it may have and so on
- the following forms are recognized:
- {filename}
- {filename arg1 arg2}
- {filename &arg1=val1 &arg2=val2}
- see p. 12
- &MESSAGE and other preprocessor directives are processed
(including abbreviation support)
- tabs are expanded into spaces on output
- all name/variable/argument references are implemented using the
dictionary, while
honoring the global and non-global scope of each reference.
- conditional inclusion directives are supported
- &IF expression
&THEN..block..[&ELSEIF expression
&THEN..block..&ELSE..block..]&ENDIF
- conditionals can be nested and can cross newline boundaries
- expressions using the listed operators
and functions are fully supported
- the only known feature of
the preprocessor that is not implemented is this list of
unsupported functions
- all valid preprocessor expression types are supported
- numeric (integer or decimal) which is always cast to an
integer on return and compared to 0 (0 is false, non-zero is true)
- string which is not empty (true) or empty (false)
- boolean which is true or false
- based on the result of the expression evaluation, the blocks
of code are either passed through to the output or not
- Everything else is unrecognized and is assumed to be the
Progress 4GL language.
- This unrecognized input is supposed to be unchanged and is copied
directly to output.
- Producing statistics and reports.
Special Character support:
Input
|
Output
|
;& |
@
|
;< |
[
|
;> |
]
|
;* |
^
|
;' |
'
|
;(
|
{
|
;%
|
|
|
;)
|
}
|
;? |
~
|
Operator and Function support:
Operator/Function
|
Type
|
+
|
operator |
-
|
operator |
*
|
operator |
/
|
operator |
= or EQ
|
operator |
<> or NE
|
operator |
< or LT
|
operator |
> or GT
|
operator |
<= or LE
|
operator |
>= or GE
|
operator |
AND
|
operator |
OR
|
operator |
NOT
|
operator |
BEGINS
|
operator |
MATCHES
|
operator |
MODULO
|
operator
|
DEFINED()
|
Preprocessor symbol dictionary
function
|
The
following functions are NOT IMPLEMENTED since they are not necessary to
preprocess the application. These are the only known
features of the Preprocessor that are not supported. All other
features are 100% implemented.
- ABSOLUTE
- ASC
- DATE
- DAY
- DECIMAL
- ENCODE
- ENTRY
- ETIME
- EXP
- FILL
- INDEX
- INTEGER
- KEYWORD
- KEYWORD-ALL
- LEFT-TRIM
- LC
- LENGTH
- LIBRARY
- LOG
- LOOKUP
- MAXIMUM
- MEMBER
- MINIMUM
- MONTH
- NUM-ENTRIES
- OPSYS
- PROPATH
- PROVERSION
- R-INDEX
- RIGHT-TRIM
- RANDOM
- REPLACE
- ROUND
- SQRT
- STRING
- SUBSTITUTE
- SUBSTRING
- TRUNCATE
- WEEKDAY
- YEAR
Due to the problem with using named or
positional arguments in external
procedures, it is possible that some (hopefully small) percentage of
the source cannot be completely preprocessed without custom logic
(probably enabled through manual analysis and the Preprocessor Hints
file). So on output we will have a processed 4GL source file,
though there may be cases in which this processing is partial. In
all these cases, a report will be generated that documents such
problems. Other report output will also be available to allow
easy analysis of any arbitrary source file for preprocessor content.
A good question the reader may be asking is: why not just use the
Progress environment's ability to save a preprocessor output file (by
using an option to the COMPILE language statement) instead of spending
the time to make a clean room replacement for the Progress
preprocessor? There are 4 reasons:
- Having a pure Java, clean room preprocessor eliminates all
dependencies upon having a working Progress runtime environment in
order to make a conversion.
- The code conversion process needs to know exactly which include
files are used and exactly where they are included in order to make
proper decisions about which code can be made into a common class and
used from multiple locations. Only the preprocessor has this
knowledge. Without it the code conversion would have to run
arbitrary comparisons to find all the locations of larger code patterns
that match across the entire project. This is a task that is not
feasible.
- Preprocessor conditionals (&IF directives) cause each file to
potentially have multiple output variants. All of these locations
would have to be known and each variant would have to be manually
generated (using the COMPILE statement) in order to ensure that all
possible code paths are visible to the P2J environment. Instead,
the P2J preprocessor will have a generate all variants mode that allows
this to occur automatically (since this there are only a finite number
of possible permutations).
- RUN statement arguments cause a
runtime form of multiple variants. The big problem is that these
can be theoretically infinite in number. Thus having the
preprocessor available at all times makes this problem more easily
resolved.
Detailed Design/Implementation Process:
- Design the format for the Preprocessor Hints file.
- Design and implement the set of statistics to be captured.
- Design the format and content of the output report. Based
on options, this must provide a detailed listing of the expansions and
modifications that occurred during this preprocessor run.
Implement this report output.
- Design the format for the Code Conversion Hints file. Note
that from a preprocessor perspective, the detailed listing of the
expansions must be written into the Code Conversion Hints format.
It is most important that the information regarding include file
processing is complete. The later stages of code conversion would
have no knowledge of where includes are used without this output.
The objective is to maximize the ability of the later conversion stages
to write a common class (and then call into that code multiple times)
for as much included code as possible.
- Implement helper class(es) to process the hints files on input
and output.
- Implement the file tree processing harness which takes its input
from the hints file and calls the preprocessor to process the entire
application source tree.
Dead Code Analysis (Stage
2)
The objectives of this stage are twofold:
- To make a complete list of all possible "root" procedure entry
points into the 4GL application. This is a simple list of all
possible external procedures (files) that can ever be run as a starting
point for an end user.
- To make a list of all 4GL application files that cannot ever be
executed due to a lack of a call path that allows them to be
reached. This is the dead code list. These files are not to
be converted.
The effort starts with a manual analysis of the possible methods of
launching Progress (in particular the name referenced in the -p option
on the command line). A list is made of all external procedures
which can be launched into at startup. Then the call tree is
manually analyzed to trace the top level flow. This step is only
necessary if there is a layer of abstraction between the top level
entry points and the real application entry points.The information is encoded in a form
that can be easily read, parsed and used by the dead code analyzer.
The dead code analyzer is a module that uses the list of root entry
points as a starting point. It reads each root entry point and
generates the resulting UAST. The call tree nodes in this UAST
comprise a list of all possible external procedures accessible from
this root entry point. This process is repeated for all root
entry points and the union of all possible accessible external
procedures is recorded. All other files that are not part of this
union are considered dead code since they can never be executed.
The dead code list must be created with awareness of which include
files are actually in use. Only those include files that can
never be reached should appear in the dead code list. This input
is a combination of the knowledge output by the preprocessor into the
Code Conversion Hints and the first pass at the Dead Code List where
all the dead external procedures are known. This second pass can
complete the list using these first two inputs (any include files only
accessible from a dead external procedure is also considered dead).
Detailed Design/Implementation Process (starting this is dependent upon
the completion of development of the Progress 4GL
lexer/parser and UAST):
- Design the root entry point list format.
- Design the dead code list format.
- Develop the dead code analyzer.
- Run this against the complete application source tree.
- Review the results with the client and obtain confirmation that the
dead code list is accurate.
Annotation
(Stage 3)
Once the Source UAST is available, the job of conversion cannot be
started without additional analysis. The majority of this
analysis is a matter of pattern recognition. As certain patterns
are found in the Source UAST, it is useful to remember that these
patterns exist, where they exist and other related or derived
information about these patterns. This information will be stored
in the Source UAST itself in the form of "annotations" to the
tree. So the primary input to Stage 3 is the Source UAST and an
Annotated Source UAST is the primary output.
This stage leverages a generic framework and infrastructure for
implementing a data-driven pattern recognition and annotation
process. The infrastructure will be built once and each situation
that calls for an annotation will be described as a rule. This
rule is made up of a simple or complex pattern to be matched and one or
more annotations that should be made in the case of a match. The
patterns themselves are written as expressions using a modified form of
the Java Trace Analyzer (JTA) expression language.
Other options for handling this include:
- Hard coding the pattern recognition and annotation logic into a
large amount of custom Java source code that runs against the UAST.
- Hard coding complex and varied logic into the lexer and parser to
try to annotate as each AST is built.
The second option is not technically viable. One
will need to annotate based on analysis of nodes that exist across
multiple ASTs or between multiple levels of the UAST. Since the
lexer and parser only ever operate on a single Progress 4GL AST, this
would be impossible. In addition, it is troublesome and limiting
to only be able to annotate while lexing or parsing. In other
words, one cannot look ahead enough to make a wide enough range of
pattern recognition possible. For these reasons, there is a
strong separation between the Progress 4GL lexing/parsing and the
subsequent annotation process.
While the first option is very possible, taking the generic, data
driven approach has the advantage of allowing the same, well debugged,
working Java code to be used thousands of times by just specifying
rules rather than hand coding Java source for each one. This will
save a great deal of coding, debugging and testing time.
Most importantly, the same core
technology used here in Stage 3 will also be used in Stages 4/5 to
create the target tree. It is expected that 80% of the conversion
can be done using this generic set of pattern recognition technologies
and the subsequent rule based actions. Knowing this, the obvious
question is: why split the conversion up into 3 phases when the same
technology is used for all 3? The reason is simple: by using
multiple, sequential passes (a pipelined approach), certain problems
can be solved in one pass and which means that subsequent passes can
ignore these issues. For example, the lexer deals with the
problems of converting characters to tokens and the parser only deals
with tokens and relationships between tokens. This simplifies the
implementation of both the lexer and the parser and allows each to do
one thing well. For the same reason, there are things that must
be done to the entire tree before other processes can easily be
started. These tasks have been split up (loosely) into the stage
3, 4 and 5 processing.
How the technology works:
A list of rules encodes the core data necessary for the annotation
process. This list of rules is called a ruleset. There can
be multiple rulesets, each used for specific purposes.
Each rule includes an expression (either
simple or compound) which matches some subset of all UAST nodes.
If the expression only refers to the contents of a single node
(Progress 4GL language statement) and its immediate children then it is
considered a simple
expression. If the expression refers to data from multiple nodes
(multiple language statements) in order to determine a match, this is a
compound expression. In other words, a compound rule is one that
references stateful information stored regarding previous nodes
(parents and/or siblings and their subtrees) in the tree. This
state information may be stored in a scoped dictionary and can be
referenced to match context sensitive patterns.
A rule also includes an action that should
be executed on any match of the associated expression. The list
of valid actions is the list of annotations that are possible.
Many annotations will edit the UAST itself, while other annotations may
maintain stateful information for other rules or to provide
dictionaries for conversion processing. An Interface will be
defined to allow annotation actions to be "plugged in" without having
to modify core processing. Another action example is creating a
custom logfile or statistics (e.g. a scanning report).
The standard Tree Processing Harness
will be used to iteratively process the Source UAST. One pass
will be made for each rule to be processed. Thus each rule will
be tested against every node in the tree.
The Pattern Recognition Engine is the primary component of this
stage. It reads the list of rules and for each rule it is
responsible for utilizing the tree processing harness to walk the
tree. The pattern recognition engine is responsible for providing
user functions and variable resolution to the expression engine. The first time
an expression is used, the generic expression engine parses and
compiles the expression into a Java class which is stored in the
expression cache. Every subsequent time that expression is used,
it is directly run from the cache. Variable values and the
results of user functions are derived at runtime. These callbacks
are made by the compiled Java class and the pattern recognition engine
services these need.
As the harness walks the tree, at each node the entire list of rules
will be processed (as opposed to walking the tree once for each
filter). This is an important enabler of compound rules because
this allows a real-time scoping methodology to be used. As nodes
in the tree are encountered that add or remove scopes to a data
structure or dictionary, annotation actions will always be able to
simply trigger their actions on the correct scope (usually the current
scope, but sometimes a global scope). Then subsequent rules will
be able to directly lookup or reference such data that is naturally
scoped to their current node. The alternative would be to make
each added record have a scoped definition and to make the lookup
itself be aware of only returning the value that matches the correct
scope. This is possible but is much more work.
When a match is found, the associated action is triggered which makes
an annotation to the tree or to other related data structures,
dictionaries, trace or log files. Once all rules have been
processed against the current node, the current node is changed to the
next node rule list is evaluated again. This process
continues until all nodes have been processed. At that point, the
Source UAST is fully annotated and Stage 3 is complete. The
Source UAST will most likely be stored persistently in the filesystem
to ensure that the results are saved for future stages.
A powerful feature of actions is their ability to calculate or
derive the data which they will insert as an annotation ("smart
actions"). For example, one would imagine that encoding the
conversion category or conversion level would simply set a known
attribute of a node to a constant value. However, one could also
envision translating a Progress name into a Java name and saving this
as an annotation. This latter example is a condition where the
annotation applied by the same rule is different for each node because
it is dynamically generated.
Detailed Design/Implementation Process (starting this is dependent upon
the completion of development of the Progress
4GL lexer/parser and UAST):
- Make a list of the known types of annotations that need to be
made. For each type, design a method to store the annotation in
the Source UAST, Target UAST or in a target data structure if the
annotation is not a tree edit. At a minimum, the following must
be provided for:
- node attribute edits
- tree edits
- insert node
- move node
- delete node
- scope aware resource directory - constants, strings, shared
variables
- scope aware lookup lists to store stateful information for
compound rules
- statistics and logging output (to create reports describing the
source tree)
- Make a list of the user functions (usable inside the expressions)
needed to process the Source
UAST. For each one, detail the signature and function to be
provided.
- Design the approach for variable resolution. This must
include support for both simple (referencing data within the current
node of the tree) and compound (referencing data across multiple nodes
of the tree) rules. Some ideas about handling compound rules:
- Allow actions that store arbitrary data into lookup tables that
can be referenced via user functions or variables. This would
allow some rules to be written to populate the lookup tables and then
subsequent rules could reference the tables and make decisions based on
the contents. This effectively allows a multi-statement analysis
without directly referencing nodes elsewhere in the tree.
- Allow positional or structural references to other nodes in the
tree. For example, one might want to reference data in the root
node of the containing block. This might allow one to reference
data in a loop construct. Likewise one might need to access child
nodes for specific data.
- Allow named and typed references to other nodes in the
tree. For example, one might want to reference data in the
nearest scoped node of type "frame" that has a name of
"warning_message".
- Design the Interface to plug in actions.
- Design the format for encoding each rule (the simple or compound
expression and the associated action).
- Implement the Pattern Recognition Engine, variable resolution and
user-defined functions.
- Implement the known list of actions.
- Re-implement (in Java) the REXX based scanning and analysis code
using rulesets and the Pattern Recognition Engine.
Structural Analysis
(Stage 4)
In this stage the 4GL application source will be analyzed and a
structural design of the target Java application will be
constructed. The output of this stage can be considered the
"skeleton" of the Target
UAST which will be completed by the Stage 5 Code Conversion.
The structure and interfaces of the Target UAST must be generated
completely in Stage 4, before any code is actually converted.
This is necessary because of the refactoring that is occurring between
the source and target application. Since the code's natural
structure is not staying intact, it is necessary to generate the
complete target structure so that when code is converted, the code
locations and the interfaces (used as linkage between related objects)
are known. Otherwise, there would be a problem in generating any
code that might have to link to code that is not yet generated.
By precalculating the interfaces and structure, each piece of code can
be converted separately in Stage 5.
The tree processing harness module and the Pattern Recognition Engine from Stage 3
will simplify the processing of each
(Annotated) Source UAST as a related unit. Each such tree can be
considered a
standalone application since each tree encompasses all source code
that can be possibly reached from a specific root entry point.
The full set of these trees can have overlapping contents to a greater
or lesser degree, as many files (and thus the resulting subtrees rooted
at those locations) will be reachable from multiple trees. To the
extent that these overlapping branches are already processed, the tree
processing harness allows bypassing reprocessing such branches.
Refactoring is done by defining a set of conversion
categories that in total,
represent the complete set of function of the target application.
Each category is only responsible for a specific subset of the
application function. In Stage 4, the function that is
category-specific is called the "Category Strategy Choice(s)".
Each Category Strategy Choice module will use the Stage 3 Pattern
Recognition Engine and a category specific ruleset to walk the tree
looking for
structural patterns that correspond with a resulting target
structure. Much (if not all) of the recognition logic needed to
categorize the
Source UAST is handled by Stage 3 (Annotation). By starting with
the
annotated UAST, the job of analyzing the category-specific nodes of the
tree is significantly easier. The Stage 4 ruleset is designed to
analyze the category-specific structure and to use actions to generate
the structural nodes or skeleton of the Target UAST. It is
expected that 80% or more of the conversion cases can be handled in
this manner.
Once the pattern recognition engine has run for a specific category,
the Category Strategy Choice module must process all remaining
unhandled nodes of that category in the Source UAST. This is
where category specific custom coding may occur to handle cases that
cannot be properly (or easily) represented using the Pattern
Recognition Engine. At a minimum, it is important that the
conversion ruleset will at least flag those areas that require more
attention.
In cases where there is more than one strategic choice for how to
structure the target, the Category Strategy Choice module must decide
on the correct structure. It is possible that multiple strategies
are chosen for different parts of the source application, however the
strategic choice can be as simple as makes sense for a given
situation. Note that the Code Conversion Hints file is an input
to this process. This file can store input from prior stages as
well as manually created input. In either case, this input may
have the affect of overriding a default structural choice in preference
for another.
The Category Strategy Choice must generate nodes in the Target UAST for
the target Classes, Interfaces and the corresponding inheritance
model. In addition, the Source UAST will be cross-referenced to
the Target UAST locations (to allow Stage 5 to know exactly where to
place its results) and vice versa. For example, a particular user
interface "Frame" may be analyzed by the UI Category Strategy Choice
module and a class may be created in the Target UAST. Then both
the source and target will be annotated with a reference to the
associated node(s) in the other UAST.
The Code Conversion Hints that are generated by the Preprocessor will
be used to know where inlining has occurred
(and which source files the inlining occurred from). This will
allow the code conversion to generate Java classes once that are reused
in many places where inlining used to occur. We will need to
programmatically determine when it is appropriate to create a class
versus allow the inlining to occur:
- If there are no { } substitutions and the code doesn't directly
reference specific local variable names, the odds increase that a
separate class is possible. Even in these cases, the code may be
rewritten to handle these issues.
- If the inline file contains function/procedure definitions as the
primary or only code, then it is likely these can be moved into a
central class.
The next step in Stage 4 is to perform a Linkage Analysis. This
is
the identification of each place where two pieces of code
interface.
For each of these locations, a decision must be made about the linkage
strategy used in the target. As a result, each Class and
Interface in
the Target UAST must have its methods (including signatures)
generated. This also includes defining accessor methods (getters
and
setters) for data members. This process may be overridden or
modified
by code conversion hints. Both the Source UAST and Target UAST
will be
annotated with a cross-reference to the associated node(s) in the other
UAST.
Once the object or class hierarchy is generated and all linkage points
has been resolved, the Naming Generation/Conversion must run to convert
or translate Progress 4GL names into valid Java names. Certain
changes will be required since some Progress 4GL symbol characters are
not valid Java symbol characters. For example, all hyphens must
be converted to another character (the
hyphens cannot necessarily be removed because they may then generate a
naming conflict). In particular, it is a challenge to generate
"good" Java names based on the current
(good or bad) Progress 4GL names. Heuristics are written for
doing the best job possible and for identifying those
situations in which the result is just unacceptable (flagging these for
human review). Any human generated overrides or modifications
would be stored in
the hints
input to the code
conversion processing.
When complete, Stage 4 leaves behind the complete structure of the
target application. Please note that due to refactoring, the
Target UAST will not have a single root, but instead it will have many
smaller trees that are each likely to be category specific.
Outputs include:
- Interactive hints mode where the conversion process halts at
ambiguous decisions or decisions that need human review. At these
points, the operator would be prompted with the choices and any
non-default choices would be stored in the Code Conversion Hints file.
- Cross-reference annotations written back into the Source UAST.
- Creation of a skeleton structure in the Target UAST.
Detailed Design/Implementation Process (start is dependent upon
completion of Stage
3 implementation AND upon the Java
UAST implementation):
- Design and implement a mechanism to store cross-references in
both Source and
Target UASTs.
- For each form of possible programmatic linkage (this includes all
call
or program invocation mechanisms as well as direct variable references
using shared variables), define the rules by which the Java equivalent
is made. Document any ambiguous cases where manual review is
required.
- For each Java language construct that must be represented
(structural, naming, linkage), design and implement a Target UAST node
to represent
this construct in the tree.
- Design and implement a mechanism to track the conversion status
of each node in the Source and Target UASTs. This will allow a
standard tree walk to be implemented that only walks those nodes that
still must be processed. This is a concept of saving state to
allow a form of incremental or multi-pass tree processing.
- Strategy Definition
- For each conversion category, define the list of structural
choices ("strategies") in the target application. This can be
considered as a task of 2 levels of complexity:
- analysis must occur based on the structure of the source code
and the knowledge of the structural options which exist in the target
- some structural choices may be driven based on repetition
rather than structure (for example, in some cases, the same sequence of
statements found in many places can be written as a common class(es)
and/or set of methods that can be used in multiple "client" locations)
- For each possible structural choice, define the conditions in
which it is the correct choice.
- Confirm that all possible source conditions are handled.
If this is not the case, a manual conversion is required or additional
strategies are needed.
- Document any ambiguous cases, where manual review will be
required.
- Add rules to Stage 3 to enable identify and
enhance the conversion processing. In some cases, an alternate
ruleset
will be used to generate category specific reports. A good
example is
identifying all code that needs manual review. The Stage 3
changes are primarily about modifying the Source UAST.
- Create a ruleset for the Stage 4 category specific structure
creation in the Target UAST. This is the core of the conversion
processing. If necessary, multiple sequentially processed
rulesets can be implemented. Remember that each ruleset is
processed fully on a single node before the tree walk continues.
Any processing that cannot be done in one pass, can be handled by
adding additional passes using other rulesets.
- Develop the Category Strategy Choice module that handles all
conditions that cannot be properly or easily defined using the Pattern
Recognition Engine. These are the special cases (the 20% that is
left over).
- Define and implement the Pattern Recognition Engine ruleset for
name generation/conversion. These are
by nature category specific as Java has well defined rules for naming
that differ by type of symbol (e.g. method name versus variable
name). Document any ambiguous cases where manual review is
required.
Code Conversion (Stage 5)
The goal of Code Conversion is to finish the Target UAST skeleton
created in Stage 4. Once Stage 5 is complete, the Target UAST is
done and ready to be output to Java source files.
All nodes in the Source UAST can be classified in one of five possible
levels:
Conversion Levels
- Elimination
- Direct Mapping (to Java language, J2SE and/or GCD runtime)
- Automated Rewriting
- Manual Rewriting
- Unknown (review required)
L0 code does not have to be converted because it serves no functional
purpose in the target application. Examples of reasons for L0
code include Progress-specific batch processing and code that is not
reachable because it is platform specific (for a platform that is not
run by the client).
L1 code has a direct mapping from Progress 4GL into Java. This
means that not only is the function equivalent between the two
environments, the source and target logic is identical. An
example of L1 code is an arithmetic expression where the operators and
formatting may be different, but there is a 1 to 1 correspondence in
the logic between the two environments.
L2 code can be programmatically converted such that the result is
functionally equivalent, but the resulting Java is logically different
from the original source. This conversion is called a "rewrite"
of the code and an L1 is automatically rewritten.
L3 code can be rewritten by hand to provide equivalent function, but
(at this time) it cannot be programmatically rewritten. This is
the same as L2 but with a manual conversion.
L4 code is unknown. This means that some manual review is
required.
Each node in the Source UAST has an assigned level. This
assignment occurs using the Stage 3 Annotation process. The Stage
5 processing only handles L1 or L2 code. All other code is
ignored.
As noted in prior stages, all code will be handled by separating the
application function into a finite number of categories. The
following is the current list (it may be incomplete or incorrect):
Conversion Categories
- comments
- constants
- string resources
- string literals
- messages
- help text
- flow of control
- expressions
- logical
- arithmetic
- there are no *documented*
bitwise operators
- assignments
- other base language statements
- functions
- methods/attributes
- variables including array processing
- types
- Progress 4GL data types will be represented by Java equivalents
that provide for features like UNDO
- this
must include the processing of the LIKE keyword to create a variable of
the same type as some other referenced variable
- database access
- This Conversion Category will write the Java client code to
instantiate, access and control the data model.
- As the business logic of the application is generated, it is
written with full knowledge of the how the the
Progress 4GL schema maps to the data model. This knowledge is
encoded in the P2O Mapping which was generated during Stage 3 of the
Data/Database Conversion.
- All explicit and implicit database references need to be
converted using the P2O mapping.
- I/O (files, pipes, devices and printers)
- user input processing (keys)
- accelerator keys (hotkeys)
- event processing (UI, database, procedure)
- screens and dialogs
- UI controls or widgets
- menus, navigation and scrolling
- security rules
- input validation rules (used in transactions, procedures)
- transaction interfaces (client code, service interface)
- operating system commands and external programs/scripts
- dynamic source file creation and execution (it is possible to
write Progress 4GL source code to a file and then execute that code at
runtime)
Conversion Categories organize the activities in both Stage 4 and Stage
5. This is the essence of how refactoring is handled by the P2J
environment. Each category handles its portion of the conversion:
- Without mixing in code from other categories.
- In a manner that is optimal for how that category of function is
handled in Java.
The result is a set of Classes, Interfaces, methods, resources and
other output that is a set of well designed Java objects that represent
the minimum code necessary to implement the specific category of code.
In Stage 5, there is a Code Conversion module for each category.
Each such Conversion module processes the Source UAST
using the Tree Processing Harness and the Pattern
Recognition Engine from Stage 3. For each node which is
marked
for
the category being converted and which has conversion level L1 or L2,
the Code Conversion module will run one or more rulesets through the
Pattern Recognition Engine. The ruleset actions will create/edit
the proper Target UAST nodes to ensure that the
resulting application has equivalent function. If necessary,
processing can be implemented as multiple, sequentially run rulesets
(each ruleset is processed across the entire tree before the next
ruleset runs). After the Pattern Recognition Engine is done
processing all category specific rulesets, the Conversion module will
process all category specific nodes for which conversion could not be
properly or easily specified in the rulesets. This is a kind of
post-processing that allows any custom conversion strategy to be
implemented. The idea is to handle 80% or more of the conversion
using the Pattern Recognition Engine and then to handle the rest with
custom code if needed.
Once all category
specific nodes in the Source UAST are processed, that categories' Code
Conversion is complete. Note that the Code Conversion Hints
provide a mechanism for feedback from other stages and from manual
review. These hints can modify or override the default conversion
behavior that would otherwise occur. In some cases, nodes in the
Source UAST may be ignored using hints. In other cases, a
manually created replacement for some specific nodes may be
specified. In still other cases, a non-default conversion choice
may be forced via hints. This is especially important in any
conversion situation in which the choice is ambiguous.
Note that all hints are customer application specific. These are
ways to
override the default behavior or processing. Some hints are
global (ignore the following list of language keywords: VMS, OS2...)
and some hints are file-specific (tied to a specific piece of Progress
4GL
source code).
An important point must be made regarding the lack of a language
neutral intermediate representation of the application. One can
consider that in between the Source and Target UAST, there could be an
artificial language neutral representation (an "Intermediate
UAST"). This would allow one to more easily abstract (and thus
substitute) front ends (4GLs) and back ends (target languages other
than Java) that provide support for different language
conversions. Instead, we have chosen a direct conversion between
Source and Target UASTs, which means that the P2J code conversion tools
are very specific to a Progress 4GL front end and a Java back-end.
The Source UAST is Progress 4GL specific and the Target UAST is
Java specific. No language neutral Intermediate UAST will be
used. This limits the tools (as implemented at this time) to
Progress to Java conversions. It also allows the process to be
simplified and for specific conversion problems to be solved with an
optimal solution that is much easier to achieve than it would be with a
decoupled process.
Outputs include:
- Reports mode where the nodes are flagged that have ambiguous
decisions or decisions that need human review. A manual review
would be done off this report and any
non-default choices would be stored in the Code Conversion Hints file.
- A Target UAST that is ready for output into Java source code
and/or into a bidirectional cross-reference of the conversion.
Detailed Design/Implementation Process (start is dependent upon
completion of Stage 3 implementation
AND upon the Java UAST
implementation):
- Determine if categories are mutually exclusive or if they can be
overlapping. For example, there can be user interface related
options or clauses to variable declarations (e.g. format strings for
how a variable should be displayed). While the greater variable
declaration has nothing to do with the UI, some portion of it may be UI
related.
- For each Java language construct that must be represented, design
and implement
a Target UAST node to represent this construct in the tree.
- Code Conversion Plan
- For each conversion category, a list of all L1 and L2 source
language constructs must be made.
- For each language construct, one or more Java language target
constructs must be defined.
- Patterns must be defined to determine in which cases the
target constructs are chosen, in the case where multiple target
constructs exist for the same source construct.
- Document any ambiguous cases, where manual review will be
required.
- Add rules to Stage 3 to enable identify and
enhance the conversion processing. In some cases, an alternate
ruleset
will be used to generate category specific reports. A good
example is
identifying all code that needs manual review. The Stage 3
changes are primarily about modifying the Source UAST.
- Create a ruleset for the Stage 5 category specific
details/source code conversion in the Target UAST. This is the
core of the conversion
processing. If necessary, multiple sequentially processed
rulesets can
be implemented. Remember that each ruleset is processed fully on
a
single node before the tree walk continues. Any processing that
cannot
be done in one pass, can be handled by adding additional passes using
other rulesets.
- Develop the Category Conversion module that handles all
conditions that cannot be properly or easily defined using the Pattern
Recognition Engine. These are the special cases (the 20% that is
left
over).
Output Generation (Stage 6)
This stage is designed to generate output based on the Source and
Target UASTs. It does not alter either UAST in any way, nor does
it make any decisions. It simply uses the Tree Processing Harness
to walk the UASTs and for each node, it will generate output.
It is important to note that no conversion logic is embedded in the
output engines. The Java Language Formatting Engine (JLFE)
processes the Target UAST and for each note it generates the
syntactically correct Java source. If the node represents a Java
language "for" loop, the JLFE will output the text that includes the
main for ( ; ; ) the { and }. Between the braces it would output
any enclosed nodes in the tree. The JLFE only knows how to output
Java source for each Java source construct represented in the
tree. It does not make any decisions, it only outputs what it
finds. Using JLFE options, the formatting of the text output can
be controlled in certain ways. For example, the number of spaces
in every indent can be specified as an option though it defaults to 3
spaces.
The Report Engine reads both the Source and Target UASTs and it
generates a bidirection cross-reference to the Progress 4GL and Java
source code. All of the cross-reference information is already
stored in the UASTs so this engine only formats this representation
into HTML and writes this output into multiple files. This output
is used to verify the correctness of the conversion. More
importantly, it is a continuing resource for the application developers
that are to maintain the converted application.
Both engines are based on a core Template Processing Engine that takes
a UAST node and a corresponding template and emits some text as a
result. Once the base templating code is working, the development
of each node-specific output is much faster because the knowledge of
how to handle that node is encoded in a small piece of data that is
distilled down to the minimum necessary information. Rather than
spending time coding, debugging and testing custom Java code for every
node, one just needs to debug the template itself. This is a much
smaller effort than writing custom Java code for each node. The
templates provide the ability to handle literal text, substitution of
node attributes and the specification of how the standard formatting
rules apply to this template. Any such standard formatting rules
would highly standardized across all output and might be modified by
customer specific formatting options. For example, some customers
may wish to insert a new line before opening any curly brace where
others prefer the open curly brace to be on the same line as the
language construct it is "connected" with. There is a syntax for
specifying these different parts of the template.
There may even be more than one template for each node, with one being
a default and the others being alternate formatting versions that are
used in certain circumstances. For example, there may be 2
different formats of "for" loop that are used based on whether or not
the total line length is smaller than some value (e.g. 78 chars).
The output from multiple nodes may go to the same file. It is
likely that some nodes may contribute output from multiple templates
and each template's output may be destined for different
files/streams. For example, a node representing a Java Class
would (usually) generate source code into a file of the same name as
the class. In addition, it may generate some content into an ANT
build.xml file to ensure that the generated build environment has the
proper knowledge of how to build the resulting class. There is a
mechanism for the template to be matched with the correct output stream.
Outputs:
- Syntactically correct Java source code and any resource bundles
or other files that may be needed. An ANT build.xml file is one
example. This represents the complete converted application
source code.
- A bidirectional, cross-reference of
all source and target constructs. This documentation provides an
HTML view of the source and target trees with both indices and
details. Some of the contents:
- Alphabetized symbol mapping that provides a cross-reference of
Progress 4GL variables to Java replacements (and vice versa).
- Source file/line cross-reference of Progress 4GL to Java (and
vice versa). This would be a set of hyperlinked source files in
HTML form.
- Categorization, statistics and aggregation report on both the
source and target UASTs. This will allow the conversion
tools to be
used to scope and plan a conversion (and find the features that may
need additional infrastructure in the runtime or in the conversion
tools). It also provides a convenient summary of the source and
target projects.
In the future, we want to automate the comparison of the source and
target trees for logical equivalence. This would be
possible with L0 and L1 code. This might be a precursor to Stage
6 or just run at the beginning of Stage 6. At this time, this is
out of scope.
Detailed Design/Implementation Process (start is dependent upon
completion of Stage
3 implementation AND upon the Java
UAST implementation):
- For each Target UAST node type, specify the syntactically correct
Java source code that should be generated and the placement of
substitutions based on attributes in the node. The the "for"
example above, the loop criteria being tested would be an expression
stored in the node. This would need to be emitted in the correct
part of the "for ( ; <criteria> ; )" text.
- Make a list of the standard source code formatting options and
the variations that can be specified.
- Design the HTML formats for the report output (see #2 above).
- Design a template syntax that provides sufficient function to
handle both types of output.
- Develop the Template Processing Engine, the Report Engine and the
JLFE.
Other Conversion Tasks
Implementation Process (start is dependent upon completion of the
runtime Menuing
and Navigation implementation):
- Document application-specific requirements for the account and
menuing database.
- Identify the specific Progress account/menuing database tables
and fields
that need to be converted and any changes that need to be made to the
data.
- Develop the schema hints to specify how to redefine the P2R
Mapping (in Database
Schema Conversion) and to subsequently capture this data in Data Conversion. It is
possible that additional types of conversion mappings will be necessary
to implement this function. If so, add these new mappings.
Miscellaneous
Issues
A modest up-front effort to design for a wide range of implementation
scenarios is a significant value when compared to the alternative of
deferring that effort. Adding the ability to control or modify
processing later typically requires the modification or replacement of
working components in the system. This is time consuming, error
prone
and can generally be made unnecessary with the proper design at the
beginning. Another way of stating this is that it is a fact that
the
least costly way (calculated as total ownership cost over the lifetime
of a system) to solve problems is to design the system with the
problems in mind. Adding solutions (to problems that a system was
poorly designed to handle) later to an already implemented system is
the most costly approach. Simply put, getting the design right is
critical. If a system wasn't designed to handle a particular
problem,
making it do so is going to be much harder than it ever should be.
Many of the following issues relate to the proper design of P2J to
handle a wider range of implementation requirements. While none
of these problems may be completely resolved in the first version, the
design of P2J will be such that resolving these in future versions will
be a reasonable effort.
Load
Balancing/Failover/Redundancy
Load balancing is the concept that multiple servers share a combined
workload but appear as a single server. This is typically done to
allow multiple physical hardware platforms to provide a service that
appears to be from a single virtual server. Systems implementing
this facility obtain scalability and capacity by using multiple
physical hardware systems instead of a single relatively larger system.
Failover is the concept where a failed server's workload is
transparently shifted to and handled by another server. Systems
designed for failover provide a significantly higher level of
availability than is otherwise possible.
Redundancy is the concept that multiple equivalent paths through a
system can be implemented to reduce possible points of failure.
More specifically, these multiple paths must be implemented using
duplicate components that can failover transparently. In every
place where this is implemented it eliminates a single point of failure
and increases the reliability and availability of the system.
These 3 problems are highly related. Typically, there is a single
solution to all 3 problems.
Naming and routing can be used to enable multiple identical, physically
separated back end services to provide what appears to be a single
unified service. This is very important to ensure that the client
can locate, establish communications and execute transactions with this
service without any knowledge that it is being serviced by multiple
processes, spread across multiple systems on the network.
The physical design of these services must provide for network,
hardware and software redundancy. The best scenario is to ensure
that the network is meshed such that a single failing component cannot
make the service unavailable. Then each cooperating service must
be located on a separate physical platform, on separate network devices.
The directory service combined with the router provides the load
balancing and failover from the client perspective. It must be
aware of the status of each cooperating service and it distributes
workload accordingly. Most importantly, when a service becomes
unavailable the router must ensure that the work is redirected to other
services AND that any transactions in process are restarted as
necessary to continue processing. At worst the client should only
see a temporary failure and a retry will succeed. Optimally, even
this temporary failure would not occur, however this is highly
dependent upon the state of the database/transactions and the session
state of the client. While the session state will be stored at or
near the router, the database and transaction state is typically very
difficult to share.
The P2J environment is quite reliant upon the directory server (an LDAP
implementation) and upon the relational database that the customer
chooses to implement. Both of these facilities must be
implemented as a redundant solution such that these do not become a
single point of failure.
Finally, the transaction server/router itself must be made to be
redundant with automatic failover. This will likely involve an
external (customer-supplied) solution that implements a single IP
address from the client view but multiple back end destinations in
reality. Many such solutions exist (some hardware based and some
are software based). These solutions must handle the proper
matching of the session traffic with the right destination. To
make this more seamless, the transaction server/router will be written
to very carefully maintain an identical shared copy (accessible to each
instance of the transaction server/router) of each session's
state. By ensuring consistency of this shared state at all times,
seamless failover can occur because individual transactions can be
handled by different transaction servers/routers each time without any
negative impact.
There is no plan to implement these features for the first
version. It is important to note that the system will be designed
such that adding these features in the future will not require a
complete rewrite of the system.
User Hooks and Implementation Control
It is much less effort to modify a system's behavior via a
configuration file than it is to modify or replace the components of
the system itself. Thus in the areas of P2J where it is likely
that implementor control will have great value, it is important to plan
for such control in the design.
The following are approaches that will be used to provide such control:
- User Hooks
- Some processing can vary greatly from customer to customer,
such that a default implementation will only solve the basic problem
for less than 100% of the customers.
- Other processing can be designed for integration with back-end
or 3rd party systems such that a customer specific implementation of a
generic service is necessary.
- In such cases, it is useful to implement the user hook model of
processing:
- An Interface defines the API that must be provided.
- A default implementation (or multiple possible
implementations) is provided to meet the most standard customer needs.
- The implementor that requires a different behavior writes a
Java class that implements that interface and provides the processing
needed by their specific implementation.
- This "plugin" or "provider" is specified to the P2J system
through some form of configuration or property which overrides the
default processing.
- At runtime, this user hook is called as an integrated part of
the system. It is done in-process, so there are security and
reliability implications for the writer of the hook. Such hooks
must be carefully reviewed and their implementation (at the file system
level and in the P2J configuration) must be carefully secured.
- Each hook will have a custom Interface definition as a more
generic model is not practical nor efficient.
- Examples in P2J:
- Configuration Values
- When the number of possible implementation choices are few and
are well known in advance, it makes sense to provide implementor
control via configuration values.
- In some cases, the values specified may not select between
alternatives, but instead the configuration may specify a specific set
of scalar or vector values that constrain processing (e.g. an integer
configuration value that sets a maximum size for a resource) or
identify a resource (e.g. a filename for a log file) or enable/disable
features (e.g. turning on debugging levels or tracing).
- A general approach for the P2J runtime is to implement a
default set of values that can be overridden by centralized
configuration (in the directory) and also by local JVM
properties. The local properties would have precedence
except in cases where this would cause a security problem or where a
system wide default must be maintained.
- Data Driven Rules
- Many types of processing can be defined as a set of expressions
(or rules) using a simple expression language that is evaluated at
runtime rather than hard-coded into the Java source code of an
application.
- The advantage of this approach is significant since the
development and support of custom Java code is much more effort than to
only have to specify a different set of rules that express the core
function that the implementor requires. This is a more expressive
manner to control the processing (it takes less input) and the fact
that it can be modified at runtime without altering a well-known,
well-tested code base means that it is less risky and much quicker to
implement.
- A standard Expression Engine
will be embedded or called from multiple places in the P2J
environment. This will allow each location to process these rules
based on input that is controlled by the implementor. As with any
other implementor controlled technology, there are security and
reliability implications. Careful review methods and strong
security discipline must be enforced.
- Examples in P2J:
Internationalization
All runtime modules will be enabled for internationalization.
This includes the user
interface components, the online
help system, string and other data processing and formatting.
This enablement means that if multiple resource bundles exist, they
will be properly recognized and utilized by the runtime
components. One feature that will not be present in the first
version is right to left (RTL) language support (e.g. Arabic or
Hebrew). The Java language itself handles the normal
problems related to character sets by using UNICODE. P2J picks up
this benefit "for free".
To the extent practical, string resources (literals) of the converted
application will be collected into resource bundles in a format that
facilitates internationalization. In particular, it is important
to enable the simple maintenance, editing and runtime substitution of
alternate resource bundles based on the target locale.
Locale specific data formats (e.g. dates, monetary amounts...) will
likewise be easily maintained and runtime-selected using a
configuration (stored file, database or the directory) rather than hard
coding these values into the converted application code.
Please note that this support is designed to separate out the resources
into sets that can be translated or replaced and then selected at
runtime. This represents a significant amount of the effort in
internationalizing an application, but by no means does it cover the
full effort. In particular, it is extremely likely that any given
Progress 4GL source file has dependencies on a hard coded locale.
An example would be language statements that concatenate English
strings in a given order and with hard coded padding or character
widths (possibly for columns in a report). Internationalizing the
strings alone does not resolve the basic problem since the source code
itself has assumptions regarding the locale. P2J will not address
this issue at all in the first version. If such hard coding
exists in the application, it will exist after conversion as well.
At this time there is no plan to provide multiple translations of the
P2J runtime resources (for standard dialogs, error messages or the
administrative interface) and documentation. The runtime will
support multiple locales, but only the US English locale will be
provided in the first versions. Enhancements to this will be
possible in the future based on customer requirements.
Since the vast majority of the data processing (including string and
data formatting) is done on the server side of the P2J process, it is
very likely that the client's session must define the locale and this
locale must be honored at the server. For this reason, a user in
Japan that is using a terminal that is connected to a CHARVA client
running on the P2J server, must be running a client using the expected
locale. This locale will be picked up and stored in the session
state by the server side. Subsequent processing on the server
will use this locale. Different users running clients that have
different locales must each find that the same server honors their
specific settings without regard to the locale in which the server is
running. Likewise these different users must never see any
indication that other locales (besides their own) are in use.
Regression Testing
Unit Testing
Unit testing is designed to determine the conformance of a specific
module or component to its corresponding specification. There are
2 approaches to unit testing which differ based on the type of
processing being tested:
- Batch Processing
- Wherever possible, a command line driver program will be
provided to test a specific package or (more likely) each major
component of a package. This is easily done for batch oriented
processing (of which P2J has a great deal). This facilitates
manual testing of specific subsets of function.
- Inputs will be standardized and accepted from files.
- Baseline output will be generated from each standardized
input. These outputs can be manually and/or automatically
compared to detect any deviation from the expected results.
- Each set of known input and baseline output is considered a
testcase.
- A harness will be created to run a batch of testcases and
compare the results. This will facilitate the automated
regression testing of batch oriented processing.
- API Conformance
- For each API under test, a testcase driver application will be
written.
- This application runs a set of calls to the API, each call
having known inputs.
- The output from each call is compared to the baseline output.
- To the extent practical, the storage of these inputs and
outputs will be made external to the testcase driver application
code. This will facilitate the more rapid creation of additional
testcases by modifying data rather than hard coding the tests.
- A harness will be created to run a set of testcases and compare
the
results. This will facilitate the automated regression testing of
APIs.
- It is possible that there will be APIs that provide interactive
services (e.g. user interface components). While testcase driver
applications can be create for these interactive APIs, it is expected
that these testcases will have to be run and reviewed manually.
System Testing
System testing is designed to certify the proper functioning of the
entire system. By nature, such tests are more end-user
oriented. This means that the testcases are designed to match as
close to real application processing as possible. Testcases will
be generated by both the customer and Golden Code. At this time,
it is expected that system level testing will be a manual process of
running each testcase and confirming the results or noting the
deficiencies.
In the future, automated tests would be the objective. This is
significantly more complicated than the majority of the unit testing,
since most of the system testing is likely to involve interactive
components. It is unlikely that an automated approach can be
achieved before the first production installation.
Build Environment
ANT is used to manage builds. ANT was chosen for the following
reasons:
- It is well accepted by the Java developer community. It is
the defacto standard for Java build environments.
- It is platform independent (since it is written in Java).
- It has many features that are specifically for Java.
No platform specific scripts or components are used in the build
process. The build can run on any platform with a Sun-based
JDK. This may exclude some IBM JDKs but most JDK ports are based
on the Sun reference.
The entire P2J environment (conversion tools and runtime) is built with
ANT.
References
Progress®
Version 9.1 and WebSpeed® Version 3.1 PDF Documentation:
Documentation: Products: the Progress Company
Progress
Programming Handbook
Progress
Language Reference
PROGRESS E-Mail Group Home Page
Joanju - Home
User
Groups Worldwide Listing: User Groups: Tech Support: the Progress
Company
Trademarks
'Golden Code' is a registered trademark of Golden Code Development
Corporation.
Progress is a registered trademark of Progress Software Corporation.
'Java', 'J2SE', and 'Java Compatible' are trademarks of Sun
Microsystems, Inc.
Other names referenced in this document are the property of their
respective owners.
Copyright (c) 2004-2005, Golden Code
Development Corporation.
ALL RIGHTS RESERVED. Use is subject to license terms.