Project

General

Profile

Internationalization

Internationalization (or I18N) is a very complex subject. This chapter is not designed to provide a comprehensive treatment on the subject. Rather, the purpose is to document the minimum necessary steps to handle internationalization issues for getting an application successfully converted and running.

The most common I18N issue is to properly process character data (strings) for an application. This encompasses a range of issues including mapping the proper glyphs (visual representations of a character) to each character, sorting and the translation of character data from input sources (e.g. files or child processes) and to output sources (e.g. files, child processes or the terminal). There are many character sets in existence and Unicode is the standard that can handle all of the world's languages with a single definition. The problem is that most of the world's data, databases and source code are not using Unicode yet, especially in the Progress 4GL. The Java language handles character data internally as Unicode data, but inputs and outputs are not necessarily in Unicode. Java can process internally in a different character set than Unicode, but that requires much more intensive coding to provide that facility. The approach that FWD takes is to follow the path of least resistance in Java (to allow everything to work in Unicode internally), but to ensure that all input data and output data is translated from and to the correct encodings as needed. This means that the proper encoding of every input and output must be known and honored.

At this time, only a single encoding (character set and language combination) is honored in an application and in the conversion. This means that the same encoding must be used for all inputs and outputs. Inputs include the source code, schema files, exported data dump files for database import and any other data files or child processes that may be used at runtime. Outputs include the populated database, the terminal user interface and any files or child processes (e.g. printers) that may be written to. The way to make this work is to ensure that everything uses the same encoding (which is defined by a specific locale).

Besides character data, there are also other I18N issues such as date formats and number separators. These must be specified application-wide as well, to ensure that everything is consistent.

The character encoding and these other configuration data (e.g. date formats) are all combined into a definition of an environment that is specific to a language, country and character set. This definition is called a locale. The locale is something that is shared across multiple applications on the same system and it is usually a shared facility provided by or in the operating system.

All the examples in this chapter are for a Linux system, but other systems such as UNIX or Windows will have similar facilities. The specific commands and techniques will vary but the concepts should be the same.

Conversion_Locale

Input File Encoding

All input files should be encoded with the same character set. Find out the character encoding for the source code. Make sure that all source code files use that same encoding. When the schema files (.df files) are exported, make sure to use that same encoding in the export. When the data is dumped/exported from the database (.d files), make sure to use that same encoding. If there are any other files or inputs for the application, ensure they use that encoding. Anything that is not encoded properly should be converted.

This encoding must be the same character set that will be used when running the FWD commands. If it is not the default character set, then the default locale must be overridden when the FWD commands are executed. See Setting the Locale for FWD for details.

Operating System Locale Support

The locale is usually a set of data files that are read and used during processing of the C runtime library and/or the operating system APIs. Subsystems such as Java will depend upon the locale definitions to properly handle I18N issues.

If the specific locale needed is not yet defined in the operating system's locale definitions, it must be added. There may be an installation program or other utilities to handle this. On Linux, character sets, locales and the tools to compile new locale definitions are included with the C library. The most important tool is named localedef.

1. To examine the currently installed locales and character sets, identify the paths used on the system. Run this:

localedef --help

Near the end of the output, there will be a display like this:

System's directory for character maps : /usr/share/i18n/charmaps
                       repertoire maps: /usr/share/i18n/repertoiremaps
                       locale path    : /usr/lib/locale:/usr/share/i18n

Look inside the /usr/lib/locale/ (in this case) to find the locales that are already installed. If the required locale is not present, it will need to be compiled.

2. To compile the proper locale definition for Linux, use a command similar to this:

sudo localedef --no-archive -f IBM866 -i ru_RU ru_RU.IBM866

--no-archive causes the compiled locale definition to be created with a name the same as the last command line parameter (ru_RU.IBM866). The definition will be created in a new directory of the same name which is in the main locale path (usually /usr/lib/locale/). In this case the compiled locale will be named ru_RU.IBM866 and it will be located in /usr/lib/locale/ru_RU.IBM866/.

The -f parameter specifies the character map name to be used. There must be a <character_map_name>.gz in the directory where character maps are stored (usually /usr/share/i18n/charmaps/). In this case the character map is IBM866 and there should be a file named /usr/share/i18n/charmaps/IBM866.gz.

The -i parameter specifies the language and country definitions to use. There must be a <lang>_<country> file of the same name in the directory for the input definitions (usually /usr/share/i18n/locales/). In this case the language is ru, the country code is RU so the parameter is ru_RU and there must be a file named /usr/share/i18n/locales/ru_RU.

3. Confirm the new locale is visible in the locale list using locale -a. It is important to check this in a regular user account (not root), because the permissions of the locale directory can hide a locale from normal users. The <lang>_<country>.<charset> locale should appear in the list. If it does not, ensure the new locale's file system permissions are correct. They should look like this:

/usr/lib/locale:
drwxr-xr-x 3 root root pl_PL.IBM852

/usr/lib/locale/pl_PL.IBM852/:
-rw-r--r-- 1 root root LC_ADDRESS
-rw-r--r-- 1 root root LC_COLLATE
-rw-r--r-- 1 root root LC_CTYPE
-rw-r--r-- 1 root root LC_IDENTIFICATION
-rw-r--r-- 1 root root LC_MEASUREMENT
drwxr-xr-x 2 root root LC_MESSAGES
-rw-r--r-- 1 root root LC_MONETARY
-rw-r--r-- 1 root root LC_NAME
-rw-r--r-- 1 root root LC_NUMERIC
-rw-r--r-- 1 root root LC_PAPER
-rw-r--r-- 1 root root LC_TELEPHONE
-rw-r--r-- 1 root root LC_TIME

This can be achieved with the following commands (the order is important):
sudo chmod 0755 /usr/lib/locale/pl_PL.IBM852/
sudo chmod 0644 /usr/lib/locale/pl_PL.IBM852/*
sudo chmod 0755 /usr/lib/locale/pl_PL.IBM852/LC_MESSAGES/
sudo chmod 0644 /usr/lib/locale/pl_PL.IBM852/LC_MESSAGES/*

Note that some Linux system updates may modify permissions on these file system resources. If you find error messages to this effect, reissue the chmod commands above to restore permissions to their proper settings.

Setting the Locale for FWD

On Linux, the locale of the current process is set using the LANG environment variable. On most Linux systems, by default the LANG is set to en_US.UTF-8 (English language, US country with the UTF-8 character set). This will be picked up by all FWD tools (conversion or runtime) since the Java Virtual Machine (JVM) will naturally honor the default locale (LANG setting). The JVM is the infrastructure that allows Java programs to execute.

To force all input/output processing for the JVM to a specific locale, use the following syntax:

LANG=<locale> <command>

This overrides the LANG environment variable for the lifetime of that specific command. Alternatively, this can be set as the system default or as the default for the user's shell based on startup script entries (e.g. ~/.bashrc).

This must be used with all FWD conversion tools (e.g. ReportDriver, ConversionDriver and PatternEngine). Likewise, it must be used to start FWD servers, FWD clients and any FWD batch processes (e.g. ServerDriver or ClientDriver). This example runs the bogus Whatever program with a Russian locale:

LANG=ru_RU.ibm866 java -classpath $P2J_HOME/p2j/lib/p2j.jar com.goldencode.p2j.Whatever

Whenever possible it is best to run the Java process using UTF-8 as the default character set. If the only requirement for an override is for the conversion process to read source code files (external procedures, classes and include files) that have been encoded in a specific charset, then the preferred method to handle this is to set a global hint (e.g. put it in the abl/directory.hints file) using the source-charset hint. See Conversion Hints and look in the preprocessor section.

PostgreSQL Locale

The PostgreSQL database server on Linux relies upon the locale support of the operating system to encode and collate string data. A PostgreSQL database cluster is initialized using a particular locale, which permanently determines the collation and default character encoding of all databases created in that cluster.

As discussed in the Database Setup chapter, the FWD project provides a custom Linux locale (en_US@p2j_basic) for Progress-like, basic collation using the ISO-8859-1 character set. At the time of this writing, no custom locales have been created for other character sets. Creating a new, custom locale from scratch is a topic beyond the scope of this document, which is described in the book FWD Developer Guide.

Assuming you have created or obtained a custom locale you wish to install in Linux and with which you want to initialize a PostgreSQL cluster (for example, ru_RU@ p2j_basic), you would follow the instructions pertaining to the en_US@p2j_basic locale provided in the Database Server Setup for PostgreSQL on Linux chapter, replacing all references in the instructions to en_US@p2j_basic with ru_RU@p2j_basic (or with the name of your particular, custom locale).

By default, all database instances created in a cluster will use the default character set of the cluster. So, a database created in a cluster initialized with the en_US@p2j_basic locale will by default encode character data using the ISO-8859-1 character set (identified as LATIN1 by PostgreSQL). Assuming a custom locale named ru_RU@p2j_basic would use the IBM866 character set, a database created in such a cluster would default to the IBM866 character set. This default can be overridden using the -E option to the createdb utility when creating individual databases, but we recommend omitting this option and allowing the default behavior. Please consult the PostgreSQL user documentation for a list of available character sets.

Once you have initialized your cluster and created a database with the appropriate character encoding, you are ready to import your data. When importing data exported from a Progress database, special care needs to be taken to ensure the proper encoding is maintained. The import process uses the JVM to decode character data from your export (.d) files, to handle that data as Unicode characters while in the JVM, and finally to encode that data to the encoding expected by your database server. If the the default character set of the JVM does not match the character set of the data export files you have dumped from your Progress database, you will need to set the LANG environment variable to specify the encoding of the export files when running the import command.

If the data export files are encoded in the IBM866 character set, for instance, the following command should be run from the $P2J_HOME directory to launch the import program:

LANG=ru.RU.ibm866 java
   -server
   -classpath p2j/build/lib/p2j.jar:build/lib/{my_app}.jar:cfg:
   -Djava.util.logging.config.file=cfg/logging.properties
   -DP2J_HOME=.
   com.goldencode.p2j.pattern.PatternEngine
      -d 2
      "dbName=\"{export_file_path}\"" 
      "maxThreads={num_threads}" 
      schema/import
      data/namespace
      "{schema_name}.p2o" 

Please refer to the discussion of importing data in the Data Migration chapter for the meanings of the substitution parameters {my_app}, {export_file_path}, {num_threads}, and {schema_name} in the command above.

H2 Database Collation

The FWD runtime environment uses the H2 open source database (www.h2database.com) to manage all temporary tables and for certain internal processing; also, H2 can be used to manage the permanent database for testing and development purposes. When used internally (for the temporary tables or housekeeping), this database is embedded in the JVM process which runs the FWD server and as such, it does not use the same locale services as the PostgreSQL server. Generally, if the localization settings of the FWD server JVM are appropriate to your locale, the embedded H2 databases will also behave properly with respect to locale.

However, special care needs to be taken to ensure character data in database records managed by H2 are collated in the same way as they would be in a Progress database, which is different in some respects than Java's default text collation behavior. To address this mismatch, the FWD project provides a custom implementation of the java.text.spi.CollationProvider interface, for use with the FWD server JVM. Installation and configuration of this service provider is discussed in the String Collation Service Provider Installation chapter.

To ensure the H2 database collates its records properly for a locale other than en_US, a new implementation of the java.text.spi.CollationProvider interface will have to be developed, packaged, and installed using the Java Extension Mechanism. This would produce a jar file similar to p2jspi.jar, which would be installed in the same manner as that file. TODO: document the directory entry for a custom collation (which would need to be configured for any external H2 databases as well as for the embedded databases (temp-tables, dirty db...). However, the topic of developing and packaging such a service is beyond the scope of this document. It is discussed in the FWD Developer Guide book. TODO: is it really?

When H2 is used to manage the permanent database, the collation needs to be set using the SET COLLATION statement before any table is created, using H2's web console or its script running tool - see the Data Migration chapter on how this scripting tool is used to install the schema in the database.

Date Formatting

The order of date components (day, month and year fields) can vary from country to country or place to place. When date values are converted to and from strings, the internal date fields must be parsed or rendered in the right order. By default the order will be month - day - year (MDY). If the default is incorrect for this application, the FWD conversion and runtime tools provide a means to override the date order.

For the conversion, this can be set as the date-order global parameter (see the Project Setup chapter). In particular, this heavily affects how the data import process works, since all date fields in the data dump files (.d files) can only be properly parsed with prior knowledge of the order of the date fields.

To specify order of the 3 date sub-components, use a 3 character string using the letters "M", "D" and "Y" once each. The order from index 0 to index 2 represents the left to right ordering of the components. The leftmost date component is defined by the character at index position 0, the middle component is defined by the character at index 1 and the rightmost date component is defined by the character at index 2.

At runtime, this same knowledge must be provided to the FWD server to allow date processing to work as expected. The chapter on Running Converted Code provides guidance on getting a test server running with the converted application. Before the FWD server is started, the dateFormat value must be specified in the directory. Find the section of the directory under the path /server/default/runtime/default/ and add a section similar to this:

<node class="string" name="dateFormat">
   <node-attribute name="value" value="YMD"/>
</node>

Only specify this if you need to override the default of “MDY”.

Number Formatting

The characters used to parse number input and and format number output can vary from country to country or place to place. When numbers are converted to and from strings, the character that separates the integer and fractional portions of a decimal number must be known to parse or render the decimal “point”. For example , the period “.” character is the decimal separator in 139.78. Likewise, the integer portion of a number (decimal or not) is often separated into groups by a specific character. For example , the comma “,” character is the group separator in 31,888. If the appropriate number separators can be read from the locale of the JVM, then no extra effort is needed. If the JVM cannot obtain the locale-specific separators, then a comma will be used for the group separator and a period will be used as the decimal separator. If for whatever reason, the default values are incorrect, the FWD conversion and runtime tools provide a means to override the separators.

For the conversion, these can be set as the number-group-sep and number-decimal-sep global parameters (see the Project Setup chapter). In particular, this heavily affects how the data import process works, since all number fields in the data dump files (.d files) can only be properly parsed with prior knowledge of the separator characters.

At runtime, this same knowledge must be provided to the FWD server to allow number formatting to work as expected. The chapter on Running Converted Code provides guidance on getting a test server running with the converted application. Before the FWD server is started, the numberGroupSep and numberDecimalSep values must be specified in the directory. Find the section of the directory under the path /server/default/runtime/default/ and add a section similar to this:

<node class="string" name="numberGroupSep">
   <node-attribute name="value" value="."/>
</node>
<node class="string" name="numberDecimalSep">
   <node-attribute name="value" value=","/>
</node>

By default the numberGroupSep is a comma “,” and the numberDecimalSep is a period “.”. Only specify these entries in the directory if the defaults are not acceptable.


© 2004-2017 Golden Code Development Corporation. ALL RIGHTS RESERVED.