Project

General

Profile

Support #5183

database collation

Added by Eric Faulhaber about 3 years ago. Updated almost 3 years ago.

Status:
WIP
Priority:
Normal
Target version:
-
Start date:
Due date:
% Done:

70%

billable:
No
vendor_id:
GCD
case_num:
version:

coll.df (441 Bytes) Eric Faulhaber, 03/05/2021 05:17 PM

cp_coll.p Magnifier (3.4 KB) Eric Faulhaber, 03/05/2021 05:17 PM

CollationTest.java Magnifier (12.1 KB) Eric Faulhaber, 07/01/2021 02:57 AM

cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_1.txt Magnifier (15.8 KB) Eric Faulhaber, 07/02/2021 03:10 AM

cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_codepoint_order_20210702.txt Magnifier (15.8 KB) Eric Faulhaber, 07/02/2021 03:10 AM

cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_2.txt Magnifier (15.8 KB) Eric Faulhaber, 07/02/2021 03:10 AM

postgresql_iso-8859-1_to_latin1_codepoint_en_US.iso88591_20210702_1.txt Magnifier (15.3 KB) Eric Faulhaber, 07/02/2021 03:10 AM

postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_1.txt Magnifier (15.3 KB) Eric Faulhaber, 07/02/2021 03:10 AM

postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_2.txt Magnifier (15.3 KB) Eric Faulhaber, 07/02/2021 03:10 AM


Related issues

Related to Database - Support #4080: AWS Aurora support New
Related to Database - Feature #5491: custom locale/collation implementations WIP

History

#1 Updated by Eric Faulhaber about 3 years ago

A review of how FWD collates its database data has raised some questions as to whether we are doing it correctly for a given character set and language. The first combination we are reviewing is en_US with the 4GL's default ISO-8859-1 character set.

#2 Updated by Eric Faulhaber about 3 years ago

The attached program (cp_coll.p) was written as a starting point to review the 4GL's collation, inspired by the existing FWD tool, locale/gen-locale.p. A small, local database (coll) must be created on the target system, using the attached schema. The program must be run connected to the database. It expects one parameter: the base file name (no extension). A series of comparable output files will be created using this base name as a prefix. The files show the first 256 code points of a character set, both in code point order and in collated order by encoded character. Four files are created:

  • output in code point order without any use within a database;
  • output in code point order, where the characters of the code page have been stored in a temp-table;
  • output in encoded character order, where the characters of the code page have been stored in a temp-table;
  • output in encoded character order, where the characters of the code page have been stored in a persistent table (chardata table in the coll database).

#3 Updated by Greg Shah about 3 years ago

In the "Internationalize ABL Application" docs there are good details about codepages and all the character configuration tables (e.g. collation tables) and much more. I think we need to look at that content more carefully.

I noticed this comment:

"The figure shows that the collation table has two sections, one for case-insensitive sorts and one for case-sensitive sorts."

That suggests you may need to try multiple tests. One sorting upon a case-sensitive field and another on a case-insensitive field.

#4 Updated by Greg Shah about 3 years ago

Some additional research to consider:

1. Locales

A locale is the standard way for specifying the localized behavior of an internationalized (I18N) application or system. Locales are used by UNIX, Linux, Windows and also by application subsystems such as Java and PostgreSQL. In POSIX environments, a single environment variable LANG sets the locale. The format of LANG specifies the language, territory (usually a country) and a codeset (often called a codepage or "charset" or "character encoding"). In the US, users of Linux often will have en_US.UTF-8 as the LANG. This article why is a charset included with the locale has some interesting history of why the locale is a combination of these different factors.

The locale controls the code page used, the collation settings (which on POSIX systems can be individually set by the LC_COLLATE environment variable), number formatting (LC_NUMERIC), time/date formatting (LC_TIME), currency (LC_MONETARY) and so on.

IEEE and the Open Group have standardized locales, including how to define them. The latest standard is POSIX.1.2017 (see Chapter 6 "Character Set" and Chapter 7 "Locale"). The older 1003.1 standard from 2004 has Chapter 6 Character Set and Chapter 7 Locale. The standard defines how locales should work including the definition file formats and the localedef utility for turning text definitions into the binary files needed at runtime.

The POSIX support also defines an API for using that support. As an example, the standard C runtime includes this API.

Non-POSIX operating systems like Windows have their own locale support. For Windows, the support is proprietary instead of standards/POSIX based. For WIN32 locale support, see Locales and Languages. Note that as of Windows Vista, it is possible to create custom locales. It is hard to understand why it took that long to provide that capability.

2. 4GL Code Pages, convmap.cp and Character Processing Tables

The 4GL does not honor the concept of a locale. It does not appear to use the operating system I18N features, except possibly the low level code page support. For most of the support it has its own proprietary implementation.

For text support, it provides a limited set of configurable capabilities using "Character Processing Tables":

  • character attribute table (defines which code points are alphabetic vs not alphabetic)
  • case table (maps how each character is uppercased and lowercased)
  • collation (defines the sort order for case-sensitive character data and a separate order for case-insensitive character data)
  • code page conversion (defines mappings for converting from some code pages into other code pages)
  • word break (rules for how word indexes split character data into words)

Please note that the character processing tables seem to only have 256 slots. I see no documented way to customize the built-in character processing for multi-byte or Unicode code-pages. The 4GL does provide some support for both multi-byte and Unicode code-pages, but it seems that some level of extra application processing/awareness is needed. For example, the ChUI client does not directly support UTF-8 but can convert from/to UTF-8 in I/O. On the other hand, the GUI client does support UTF-8 as its internal code-page. The programmer must be aware of which language features and which 4GL client types support these multi-byte/Unicode code-pages.

The configuration definition exists in a convmap.dat file, which is a list of supported code-pages and each code-page's associated character processing tables (excluding word break tables). These text definitions in (and included by) convmap.dat are "compiled" into a convmap.cp binary definition using proutil.

At startup of the OE client, the configuration of the session is defined as follows:

  • Use the default convmap.cp file from the installation; OR
  • Find a convmap.cp via the PROCONV environment variable; OR
  • Find the convmap.cp from the -convmap startup parameter.

There are separate settings to select which portions of the convmap.cp are in effect.

  • code-pages
    • Code-pages can be specified for the internal processing of the 4GL and for some different I/O features. The current settings can be discovered using the SESSION handle and the the following read-only attributes: CPINTERNAL (internal processing), CPLOG (log output), CPPRINT (printer output), CPRCODEIN (r-code text segment reading), CPRCODEOUT (r-code text segment writing), CPSTREAM (stream I/O), CPTERM (terminal output).
    • The specific code-page used for each of these can be set on the command line using -cpinternal, -cplog, -cpprint, -cprcodein, -cprcodeout and -cpstream, -cpterm. You cannot set these values after startup.
    • The documentation suggests that the following are defaults (if no command line overrides occur):
      • CPINTERNAL defaults to ISO8859-1
      • CPSTREAM and CPLOG default to IBM850
      • none of the others are documented
    • Please note that ISO8859-1 is not the same as Windows 1252 (which is really closer to ISO8859-15) and neither are the same as IBM850.
  • case processing table
    • In the convmap.cp, for each listed code-page there can be 1 case processing table specified by name.
    • The case processing table in use defaults to BASIC.
    • This can be read (but not set) via SESSION:CPCASE.
    • At startup, it can be set via -cpcase.
  • collation table
    • In the convmap.cp, for each listed code-page there can be 1 collation table specified by name.
    • The collation table in use defaults to BASIC.
    • This can be read (but not set) via SESSION:CPCOLL.
    • At startup, it can be set via -cpcoll.
  • character attribute table
    • In the convmap.cp, for each listed code-page there can be 1 character attribute table. It does not have a name.
    • As far as I know it cannot be chosen at startup and it cannot be queried via a SESSION attribute.
  • code-page conversion tables
    • It seems that these are mostly for converting from some related code-page into a main code-page.
    • As such, when they appear in convmap.cp, they may not be configured for the main code-pagem but rather on the related ones.
    • The documentation is not clear and I haven't investigated this.
    • These cannot be set at startup and there is no way to query it from the SESSION handle.
  • language
    • The value reported by the CURRENT-LANGUAGE built-in function controls which text segment of the r-code strings are read from at runitme.
    • It can be set at startup using -lng or at runtime using the CURRENT-LANGUAGE statement. See #3817 for more details.
    • As far as I know, it does not otherwise affect character data processing.

The other parts of a standard locale are either not present in the 4GL (e.g. currency support, most date/time support...) OR there is some limited proprietary feature. For example:

  • date/time
    • The date order (e.g. usually set to MDY or DMY) can be set via command line -d or SESSION:DATE-FORMAT.
    • The Y2K windowing year (start of 100 year range for yy formats) is set with -yy or SESSION:YEAR-OFFSET. It defaults to 1950.
    • Whether EXPORT, PUT UNFORMATTED and MESSAGE output 4 digit years instead of the default 2 can be set via -yr4def on the command line.
  • numbers
    • The SESSION:NUMERIC-FORMAT will report "American" or European" and defaults to "American". It can be written at runtime and also changed via @SESSION:SET-NUMERIC-FORMAT() method.
    • -E can force European number formatting from the command line.
    • The decimal point character can be set with -numdec and can be read (but not written) using SESSION:NUMERIC-DECIMAL-POINT.
    • The group separator character can be set with -numsep and can be read (but not written) using SESSION:NUMERIC-SEPARATOR.

I think that is the extent of the configurable I18N support for 4GL sessions.

The database character processing is handled a bit differently. It still seems to have a concept of the character processing tables, but they are defined in .df files. You pick an empty database for the core "language" (though I think it is more correct to say that the code-page is chosen) and then load in any customizations via a .df. From there you can add your tables/fields/indexes. There is also a process by which one can change code pages for an existing database. Once set (or modified), all other character processing at the database is done implicitly.

The following are 4GL language features which can exhibit collation behavior:

  • The relational comparison operators (LT, LTE, GT, GTE) when used with text data.
  • The COMPARE() builtin function.
  • Explicit query sorting using a BY clause.
  • Overriding collation table usage in a query (FOR, OPEN QUERY, DO/REPEAT PRESELECT) with a COLLATE clause.
    • Implicit query sorting from an index.

The origin of the collation tables for the 4GL is completely unclear. An important aspect of this task is to see if we can find a proper match to some industry standard so that we don't have to maintain the custom locale for PostgreSQL.

Note that there are aspects of specific language collation in real language use which cannot be represented as a table of 256 single character sort order positions. Such an approach can order individual characters but it misses more complex multi-character ordering that is possible. POSIX collation is more complex and can handle some amount of this real world complexity.

#5 Updated by Greg Shah about 3 years ago

#6 Updated by Eric Faulhaber about 3 years ago

I've committed a newer version of the collation data collector program to 3821c/12160.

The locale/cp_coll/ directory contains the program, the exported schema needed when creating the small test database (coll.df), and a readme with setup and use instructions.

#7 Updated by Eric Faulhaber almost 3 years ago

  • Assignee set to Eric Faulhaber
  • Status changed from New to WIP

#8 Updated by Eric Faulhaber almost 3 years ago

I've reviewed the output of the cp_coll.p program from several environments:

  • en_US / 8859-1
    • Linux
    • Windows
  • en_GB / 8859-1
    • Linux
    • Windows
  • nl_NL / CP1252
    • Windows

For the Windows runs, both GUI and ChUI samples were captured. Only ChUI samples were available for the Linux runs.

I found that within a locale, character set, and case-sensitivity, the collation of temp-table fields seems to be stable. That is, the characters seem to sort in the same order on every run.

The collation of persistent tables is not as stable. From run to run, we can see that bands of characters sort in a stable order, but the individual characters within those bands sort differently from run to run. This indicates that the characters within a band compare equally during a sort operation, and something else (rowid, it seems) governs the sort within those bands.

For case-sensitive fields, the same character in the same case, but with different diacritical marks, can sort interchangeably. For example, see the following two bands of characters from two persistent table samples with lang/locale en_US and cpstream/cpinternal of ISO8959-1:

Sample 1:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
...
        65 | 41  | A         | a        | A        | 0x0000000000001b21
       194 | C2  | Â         | â        | Â        | 0x000000000000218d
       195 | C3  | Ã         | ã        | Ã        | 0x000000000000218e
       196 | C4  | Ä         | ä        | Ä        | 0x000000000000218f
       192 | C0  | À         | à        | À        | 0x0000000000002d3e
       193 | C1  | Á         | á        | Á        | 0x0000000000002d3f
...
       224 | E0  | à         | à        | À        | 0x0000000000000ad4
       225 | E1  | á         | á        | Á        | 0x0000000000000ad5
       226 | E2  | â         | â        | Â        | 0x0000000000000ad6
       227 | E3  | ã         | ã        | Ã        | 0x0000000000000ad7
       228 | E4  | ä         | ä        | Ä        | 0x0000000000000ad8
        97 | 61  | a         | a        | A        | 0x0000000000001641
...

Sample 2:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
...
       192 | C0  | À         | à        | À        | 0x0000000000001660
       193 | C1  | Á         | á        | Á        | 0x0000000000001661
       194 | C2  | Â         | â        | Â        | 0x0000000000001662
       195 | C3  | Ã         | ã        | Ã        | 0x0000000000001663
       196 | C4  | Ä         | ä        | Ä        | 0x0000000000001664
        65 | 41  | A         | a        | A        | 0x00000000000024e1
...
        97 | 61  | a         | a        | A        | 0x0000000000001b21
       226 | E2  | â         | â        | Â        | 0x000000000000218d
       227 | E3  | ã         | ã        | Ã        | 0x000000000000218e
       228 | E4  | ä         | ä        | Ä        | 0x000000000000218f
       224 | E0  | à         | à        | À        | 0x0000000000002d3e
       225 | E1  | á         | á        | Á        | 0x0000000000002d3f
...

For case-insensitive fields, the same character, but in a different case or with different diacritical marks, can sort interchangeably. For example, see the following band of characters from two samples:

Sample 1:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
...
       224 | E0  | à         | à        | À        | 0x0000000000000ad4
       225 | E1  | á         | á        | Á        | 0x0000000000000ad5
       226 | E2  | â         | â        | Â        | 0x0000000000000ad6
       227 | E3  | ã         | ã        | Ã        | 0x0000000000000ad7
       228 | E4  | ä         | ä        | Ä        | 0x0000000000000ad8
        97 | 61  | a         | a        | A        | 0x0000000000001641
        65 | 41  | A         | a        | A        | 0x0000000000001b21
       194 | C2  | Â         | â        | Â        | 0x000000000000218d
       195 | C3  | Ã         | ã        | Ã        | 0x000000000000218e
       196 | C4  | Ä         | ä        | Ä        | 0x000000000000218f
       192 | C0  | À         | à        | À        | 0x0000000000002d3e
       193 | C1  | Á         | á        | Á        | 0x0000000000002d3f
...

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
...
       192 | C0  | À         | à        | À        | 0x0000000000001660
       193 | C1  | Á         | á        | Á        | 0x0000000000001661
       194 | C2  | Â         | â        | Â        | 0x0000000000001662
       195 | C3  | Ã         | ã        | Ã        | 0x0000000000001663
       196 | C4  | Ä         | ä        | Ä        | 0x0000000000001664
        97 | 61  | a         | a        | A        | 0x0000000000001b21
       226 | E2  | â         | â        | Â        | 0x000000000000218d
       227 | E3  | ã         | ã        | Ã        | 0x000000000000218e
       228 | E4  | ä         | ä        | Ä        | 0x000000000000218f
        65 | 41  | A         | a        | A        | 0x00000000000024e1
       224 | E0  | à         | à        | À        | 0x0000000000002d3e
       225 | E1  | á         | á        | Á        | 0x0000000000002d3f
...

Regardless of case-sensitivity, some bands of other characters seem to sort interchangeably. The following samples happen to be from a case-insensitive sort, but the same behavior can be seen within case-sensitive sorts.

Sample 1:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
        32 | 20  | <SP>      | <SP>     | <SP>     | 0x00000000000024e0
         0 | 00  | <NULL>    | <NULL>   | <NULL>   | 0x0000000000002a00
...
       215 | D7  | ×         | ×        | ×        | 0x0000000000000acb
       159 | 9F  | <?>       | <?>      | <?>      | 0x0000000000001dff
...

Sample 1:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
         0 | 00  | <NULL>    | <NULL>   | <NULL>   | 0x0000000000000160
        32 | 20  | <SP>      | <SP>     | <SP>     | 0x0000000000002a00
...
       159 | 9F  | <?>       | <?>      | <?>      | 0x000000000000165f
       215 | D7  | ×         | ×        | ×        | 0x0000000000001677
...

The characters in each of these bands seem to evaluate to the same weight in terms of collation, and so the collation within those bands appears to be undefined. We do not see this same variability within the temp-tables samples. All of the above samples of characters sort as follows in all of the temp-table samples from the same environment...

Case-sensitive:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
         0 | 00  | <NULL>    | <NULL>   | <NULL>   | 0x0000000000000900
        32 | 20  | <SP>      | <SP>     | <SP>     | 0x0000000000000920
...
        65 | 41  | A         | a        | A        | 0x0000000000000941
       192 | C0  | À         | à        | À        | 0x0000000000000a1f
       193 | C1  | Á         | á        | Á        | 0x0000000000000a20
       194 | C2  | Â         | â        | Â        | 0x0000000000000a21
       195 | C3  | Ã         | ã        | Ã        | 0x0000000000000a22
       196 | C4  | Ä         | ä        | Ä        | 0x0000000000000a23
...
        97 | 61  | a         | a        | A        | 0x0000000000000961
       224 | E0  | à         | à        | À        | 0x0000000000000a3f
       225 | E1  | á         | á        | Á        | 0x0000000000000a40
       226 | E2  | â         | â        | Â        | 0x0000000000000a41
       227 | E3  | ã         | ã        | Ã        | 0x0000000000000a42
       228 | E4  | ä         | ä        | Ä        | 0x0000000000000a43
...
       159 | 9F  | <?>       | <?>      | <?>      | 0x000000000000099f
       215 | D7  | ×         | ×        | ×        | 0x0000000000000a36
...

Case-insensitive:

 codepoint | hex | character |  lower   |  upper   |       rowid        
-----------+-----+-----------+----------+----------+--------------------
         0 | 00  | <NULL>    | <NULL>   | <NULL>   | 0x0000000000000900
        32 | 20  | <SP>      | <SP>     | <SP>     | 0x0000000000000920
...
        65 | 41  | A         | a        | A        | 0x0000000000000941
        97 | 61  | a         | a        | A        | 0x0000000000000961
       192 | C0  | À         | à        | À        | 0x0000000000000a1f
       193 | C1  | Á         | á        | Á        | 0x0000000000000a20
       194 | C2  | Â         | â        | Â        | 0x0000000000000a21
       195 | C3  | Ã         | ã        | Ã        | 0x0000000000000a22
       196 | C4  | Ä         | ä        | Ä        | 0x0000000000000a23
       224 | E0  | à         | à        | À        | 0x0000000000000a3f
       225 | E1  | á         | á        | Á        | 0x0000000000000a40
       226 | E2  | â         | â        | Â        | 0x0000000000000a41
       227 | E3  | ã         | ã        | Ã        | 0x0000000000000a42
       228 | E4  | ä         | ä        | Ä        | 0x0000000000000a43
...
       159 | 9F  | <?>       | <?>      | <?>      | 0x000000000000099f
       215 | D7  | ×         | ×        | ×        | 0x0000000000000a36
...

I think the relative stability of the temp-table runs is just an illusion, though, because it seems that the "tie-breaker" when two characters sort at the same weight is the rowid assigned to the record. Note that the program generates the records three times and sorts them three times. If we just generate the records once and query them three times, the sort order is identical across the queries, for both temp-tables and persistent tables. Thus, the variability seems to be introduced in the record creation, rather than in the query sort.

The working theory is that the algorithm which assigns rowids for temp-tables is more likely to use sequential values as the records are created, whereas rowids assigned for persistent tables tend to vary more, so these can get out of sequence, even if records are created in the same order. Thus, certain bands of characters will appear to have a stable rowid sort order from run to run with temp-tables, whereas these same characters may appear to have a less stable rowid sort order from run to run with persistent tables.

Bearing in mind the above, I think it is safe to use the temp-table results as the basis for some conclusions about collation. These seem to be the "most canonical" versions of the results, as the rowid order tracks the codepoint order, and I think the codepoint order is a reasonable secondary sort criterion.

#9 Updated by Eric Faulhaber almost 3 years ago

  • Related to Feature #5491: custom locale/collation implementations added

#10 Updated by Eric Faulhaber almost 3 years ago

Ovidiu, I need your help.

In any of your encoding testing lately, have you been able to create a database cluster which uses CP1252 as the encoding, lc_ctype, and lc_collate type? I don't mean a custom locale; I need to do this using the stock Linux locales.

When I try the following, I get an error:

ecf@ecf:~/eclipse-workspace/sandbox$ sudo pg_createcluster 9.5 test_en_US_cp1252b --locale=en_US.cp1252 --encoding CP1252
Creating new cluster 9.5/test_en_US_cp1252b ...
  config /etc/postgresql/9.5/test_en_US_cp1252b
  data   /var/lib/postgresql/9.5/test_en_US_cp1252b
  locale en_US.cp1252
could not change directory to "/home/ecf/eclipse-workspace/sandbox": Permission denied
initdb: "CP1252" is not a valid server encoding name
Error: initdb failed

So, this makes sense, since --encoding must refer to one of the character sets PostgreSQL supports. So, I tried:

ecf@ecf:~/eclipse-workspace/sandbox$ sudo pg_createcluster 9.5 test_en_US_cp1252b --locale=en_US.cp1252 --encoding WIN1252
Creating new cluster 9.5/test_en_US_cp1252b ...
  config /etc/postgresql/9.5/test_en_US_cp1252b
  data   /var/lib/postgresql/9.5/test_en_US_cp1252b
  locale en_US.cp1252
could not change directory to "/home/ecf/eclipse-workspace/sandbox": Permission denied
initdb: encoding mismatch
The encoding you selected (WIN1252) and the encoding that the
selected locale uses (LATIN1) do not match.  This would lead to
misbehavior in various character string processing functions.
Rerun initdb and either do not specify an encoding explicitly,
or choose a matching combination.
Error: initdb failed

I could only get it to work if I leave off the --encoding parameter, but then it uses LATIN1 as the character set:

fwd=# \l+
                                                                     List of databases
   Name    |  Owner   | Encoding |   Collate    |    Ctype     |   Access privileges   |  Size   | Tablespace |                Description                
-----------+----------+----------+--------------+--------------+-----------------------+---------+------------+--------------------------------------------
 fwd       | postgres | LATIN1   | en_US.cp1252 | en_US.cp1252 |                       | 7048 kB | pg_default |
 postgres  | postgres | LATIN1   | en_US.cp1252 | en_US.cp1252 |                       | 7040 kB | pg_default | default administrative connection database
 template0 | postgres | LATIN1   | en_US.cp1252 | en_US.cp1252 | =c/postgres          +| 6865 kB | pg_default | unmodifiable empty database
           |          |          |              |              | postgres=CTc/postgres |         |            |
 template1 | postgres | LATIN1   | en_US.cp1252 | en_US.cp1252 | =c/postgres          +| 6865 kB | pg_default | default template for new databases
           |          |          |              |              | postgres=CTc/postgres |         |            |
(4 rows)

Of course, this leads to a familiar-ish problem when I try to insert a record with an invalid character for LATIN1 from JDBC:

ecf@ecf:~/eclipse-workspace/sandbox$ java -classpath bin:/home/ecf/projects/p2j/build/lib/postgresql-42.2.5.jar sandbox.CollationTest cp1252 1 5437
java.sql.BatchUpdateException: Batch entry 127 insert into tt1(codepoint_ori, encoded_cs, encoded_ci) values (128, '€', '€') was aborted: ERROR: character with byte sequence 0xe2 0x82 0xac in encoding "UTF8" has no equivalent in encoding "LATIN1"  Call getNextException to see other errors in the batch.
    at org.postgresql.jdbc.BatchResultHandler.handleCompletion(BatchResultHandler.java:166)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:492)
    at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:840)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1538)
    at sandbox.CollationTest.main(CollationTest.java:357)
Caused by: org.postgresql.util.PSQLException: ERROR: character with byte sequence 0xe2 0x82 0xac in encoding "UTF8" has no equivalent in encoding "LATIN1" 
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:481)
    ... 3 more

I don't have the same problem when the target database is UTF8:

   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   |  Size   | Tablespace |                Description                 
-----------+----------+----------+-------------+-------------+-----------------------+---------+------------+--------------------------------------------
 fwd       | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 7272 kB | pg_default | 

My Java test program (attached for reference) loops from code points 1 to 255, generating a String in the specified character set (in this case cp1252) from each code point. Each string is inserted via JDBC into a temporary table with 3 columns:

  • the original code point (so, just an integer from 1 to 255);
  • a case-sensitive character (type text);
  • a case-insensitive character (type citext).

This is done in the fwd test database listed above (see the class javadoc for how I set this up).

When we get to the Euro character, it blows up because of the LATIN1 encoding. Is this a bug in the Linux en_US.cp1252 locale? I don't understand why it pairs LATIN1 (i.e., ISO-8859-1) encoding with this locale. Isn't the point of the CP1252 locale to manage the characters from the CP1252 code page?

Do you see something obvious I'm doing wrong here, or is what I'm trying to do just not feasible? Thanks.

#11 Updated by Ovidiu Maxiniuc almost 3 years ago

Eric,
I have not tried to use the en_US.cp1252 locale until now. Only the custom locales. However, when I try to do it now, I get some strange errors:

om@ProDesk400G3MT:/home/om$ sudo pg_createcluster 9.5 test_en_US_cp1252b --locale=en_US.CP1252 --encoding CP1252
Creating new PostgreSQL cluster 9.5/test_en_US_cp1252b ...
/usr/lib/postgresql/9.5/bin/initdb -D /var/lib/postgresql/9.5/test_en_US_cp1252b --auth-local peer --auth-host md5 --encoding CP1252 --locale en_US.CP1252
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

initdb: invalid locale name "en_US.CP1252" 
Error: initdb failed

Do I need to add it manually to my system?

#12 Updated by Roger Borrello almost 3 years ago

A general question about the use of a custom locale. The database for an existing application would need to be dropped as well as the cluster, then the cluster rebuilt using the custom locale, and the database recreated?

Or is there a migration possible?

#13 Updated by Eric Faulhaber almost 3 years ago

Ovidiu Maxiniuc wrote:

Eric,
I have not tried to use the en_US.cp1252 locale until now. Only the custom locales. However, when I try to do it now, I get some strange errors:
[...]

Do I need to add it manually to my system?

Most likely. The non-UTF8 locales I needed for my testing were not installed by default on my Ubuntu 16.04 system, so I used the locale-gen utility (first adding locales for CP1252, ISO-8859-1, and ISO-8859-15 for en_US, en_GB, and nl_NL to the /etc/locale.gen config file) to generate them. Unfortunately, this blows away our custom locales, so I had to re-generate those afterward.

These were the specific lines I added to /etc/locale.gen, based on some Googling:

en_GB.CP1252 CP1252
en_GB ISO-8859-1
en_GB.ISO-8859-15 ISO-8859-15
en_US.CP1252 CP1252
en_US ISO-8859-1
en_US.ISO-8859-15 ISO-8859-15
nl_NL.CP1252 CP1252
nl_NL ISO-8859-1
nl_NL.ISO-8859-15 ISO-8859-15

#14 Updated by Eric Faulhaber almost 3 years ago

Roger Borrello wrote:

A general question about the use of a custom locale. The database for an existing application would need to be dropped as well as the cluster, then the cluster rebuilt using the custom locale, and the database recreated?

Yes. You don't technically have to drop the old database and cluster, but you might as well, if it was based on the old en_US@p2j_basic custom locale.

Or is there a migration possible?

Not at this time.

See also the updates to these areas of the documentation:

#15 Updated by Ovidiu Maxiniuc almost 3 years ago

Eric Faulhaber wrote:

These were the specific lines I added to /etc/locale.gen, based on some Googling:
[...]

I added those line (one or two were already there and I uncommented them). But I am stuck here. locale -a does not list them. Next, I tried

$ sudo locale-gen en_US.CP1252
Error: 'en_US.CP1252' is not a supported language or locale

I grep -ped /usr/share/i18n/SUPPORTED but I do not have any occurrences of 1252.

#16 Updated by Eric Faulhaber almost 3 years ago

Do you have /usr/share/i18n/charmaps/CP1252.gz? If so, gunzip it. I think locale-gen just runs localedef under the covers, so it has to be able to read the charmap. But, I would have thought you already have this unzipped from generating the custom locales...

I grep -ped /usr/share/i18n/SUPPORTED but I do not have any occurrences of 1252.

Hm, I don't either, but somehow I got locale-gen to create it. I may have used en_US CP1252 (instead of en_US.CP1252 CP1252) in my /etc/locale.gen file. I wonder if this is related to my problems now.

#17 Updated by Greg Shah almost 3 years ago

locale -a does not list them

If localedef was successful there should be new locale files. If that is the case, then make sure the permissions are OK. You can try sudo locale -a and if there is a difference, then permissions/ownership is the problem.

#18 Updated by Eric Faulhaber almost 3 years ago

Comparison of OE ISO-8859-1 basic collation vs. PostgreSQL ISO-8859-1 default collation (en_US)

I created ISO-8859-1 databases in OE and PostgreSQL and loaded the first 255 encoded characters into a table with a codepoint (integer) field/column, a case-sensitive character field, and a case-insensitive character field. Then I sorted by one of the character fields.

The generated files are attached:

  • 4GL
    • sorted by codepoint (cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_codepoint_order_20210702.txt)
    • sorted first by case-sensitive character ascending, then by codepoint ascending (cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_1.txt)
    • sorted first by case-sensitive character ascending, then by codepoint descending (cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_2.txt)
  • PostgreSQL
    • sorted by codepoint (postgresql_iso-8859-1_to_latin1_codepoint_en_US.iso88591_20210702_1.txt)
    • sorted first by case-sensitive character ascending, then by codepoint ascending (postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_1.txt)
    • sorted first by case-sensitive character ascending, then by codepoint descending (postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_2.txt)

Case-Sensitive

4GL

The 4GL basic collation for this codepage sorts nearly in codepoint order up through code point 64:

 codepoint src | codepoint tgt | hex codepoint | character 
---------------+---------------+---------------+-----------
             0 |             0 |            00 | <NULL>    
            32 |            32 |            20 | <SP>      
             1 |             1 |            01 | <?>       
             2 |             2 |            02 | <?>       
             3 |             3 |            03 | <?>       
             4 |             4 |            04 | <?>       
             5 |             5 |            05 | <?>       
             6 |             6 |            06 | <?>       
             7 |             7 |            07 | <?>       
             8 |             8 |            08 | <?>       
             9 |             9 |            09 | <?>       
            10 |            10 |            0A | <LF>      
            11 |            11 |            0B | <?>       
            12 |            12 |            0C | <?>       
            13 |            13 |            0D | <CR>      
            14 |            14 |            0E | <?>       
            15 |            15 |            0F | <?>       
            16 |            16 |            10 | <?>       
            17 |            17 |            11 | <?>       
            18 |            18 |            12 | <?>       
            19 |            19 |            13 | <?>       
            20 |            20 |            14 | <?>       
            21 |            21 |            15 | <?>       
            22 |            22 |            16 | <?>       
            23 |            23 |            17 | <?>       
            24 |            24 |            18 | <?>       
            25 |            25 |            19 | <?>       
            26 |            26 |            1A | <?>       
            27 |            27 |            1B | <?>       
            28 |            28 |            1C | <?>       
            29 |            29 |            1D | <?>       
            30 |            30 |            1E | <?>       
            31 |            31 |            1F | <?>       
            33 |            33 |            21 | !         
            34 |            34 |            22 | "         
            35 |            35 |            23 | #         
            36 |            36 |            24 | $         
            37 |            37 |            25 | %         
            38 |            38 |            26 | &         
            39 |            39 |            27 | '         
            40 |            40 |            28 | (         
            41 |            41 |            29 | )         
            42 |            42 |            2A | *         
            43 |            43 |            2B | +         
            44 |            44 |            2C | ,         
            45 |            45 |            2D | -         
            46 |            46 |            2E | .         
            47 |            47 |            2F | /         
            48 |            48 |            30 | 0         
            49 |            49 |            31 | 1         
            50 |            50 |            32 | 2         
            51 |            51 |            33 | 3         
            52 |            52 |            34 | 4         
            53 |            53 |            35 | 5         
            54 |            54 |            36 | 6         
            55 |            55 |            37 | 7         
            56 |            56 |            38 | 8         
            57 |            57 |            39 | 9         
            58 |            58 |            3A | :         
            59 |            59 |            3B | ;         
            60 |            60 |            3C | <         
            61 |            61 |            3D | =         
            62 |            62 |            3E | >         
            63 |            63 |            3F | ?         
            64 |            64 |            40 | @         

This includes many non-printing/control characters, punctuation, symbols, and digits 0-9. Only the SPACE character (codepoint 32) is moved to sort first (with the same weight as NULL).

The 4GL sorts the capital letters of the alphabet next, including all alphabetical characters with diacritical marks:

            65 |            65 |            41 | A         
           192 |           192 |            C0 | À         
           193 |           193 |            C1 | Á         
           194 |           194 |            C2 | Â         
           195 |           195 |            C3 | Ã         
           196 |           196 |            C4 | Ä         
            66 |            66 |            42 | B         
            67 |            67 |            43 | C         
           199 |           199 |            C7 | Ç         
            68 |            68 |            44 | D         
            69 |            69 |            45 | E         
           200 |           200 |            C8 | È         
           201 |           201 |            C9 | É         
           202 |           202 |            CA | Ê         
           203 |           203 |            CB | Ë         
            70 |            70 |            46 | F         
            71 |            71 |            47 | G         
            72 |            72 |            48 | H         
            73 |            73 |            49 | I         
           204 |           204 |            CC | Ì         
           205 |           205 |            CD | Í         
           206 |           206 |            CE | Î         
           207 |           207 |            CF | Ï         
            74 |            74 |            4A | J         
            75 |            75 |            4B | K         
            76 |            76 |            4C | L         
            77 |            77 |            4D | M         
            78 |            78 |            4E | N         
           209 |           209 |            D1 | Ñ         
            79 |            79 |            4F | O         
           210 |           210 |            D2 | Ò         
           211 |           211 |            D3 | Ó         
           212 |           212 |            D4 | Ô         
           213 |           213 |            D5 | Õ         
           214 |           214 |            D6 | Ö         
            80 |            80 |            50 | P         
            81 |            81 |            51 | Q         
            82 |            82 |            52 | R         
            83 |            83 |            53 | S         
           223 |           223 |            DF | ß         
            84 |            84 |            54 | T         
            85 |            85 |            55 | U         
           217 |           217 |            D9 | Ù         
           218 |           218 |            DA | Ú         
           219 |           219 |            DB | Û         
           220 |           220 |            DC | Ü         
            86 |            86 |            56 | V         
            87 |            87 |            57 | W         
            88 |            88 |            58 | X         
            89 |            89 |            59 | Y         
           221 |           221 |            DD | Ý         
            90 |            90 |            5A | Z         

It is worth noting that for each letter, variants with accents/diacritical marks sort with the same weight. If we add a secondary sort by codepoint, we can see this. For example, sorting primarily by character ascending, and secondarily by codepoint ascending yields the following for variants of A:

            65 |            65 |            41 | A         
           192 |           192 |            C0 | À         
           193 |           193 |            C1 | Á         
           194 |           194 |            C2 | Â         
           195 |           195 |            C3 | Ã         
           196 |           196 |            C4 | Ä         

Reversing the secondary sort to descending by codepoint yields:

           196 |           196 |            C4 | Ä         
           195 |           195 |            C3 | Ã         
           194 |           194 |            C2 | Â         
           193 |           193 |            C1 | Á         
           192 |           192 |            C0 | À         
            65 |            65 |            41 | A         

Next, there is a small block of symbols:

            91 |            91 |            5B | [         
            92 |            92 |            5C | \         
            93 |            93 |            5D | ]         
            94 |            94 |            5E | ^         
            95 |            95 |            5F | _         
            96 |            96 |            60 | `         

This is followed by the lowercase alphabet. The same rule regarding accented characters applies:

            97 |            97 |            61 | a         
           224 |           224 |            E0 | à         
           225 |           225 |            E1 | á         
           226 |           226 |            E2 | â         
           227 |           227 |            E3 | ã         
           228 |           228 |            E4 | ä         
            98 |            98 |            62 | b         
            99 |            99 |            63 | c         
           231 |           231 |            E7 | ç         
           100 |           100 |            64 | d         
           101 |           101 |            65 | e         
           232 |           232 |            E8 | è         
           233 |           233 |            E9 | é         
           234 |           234 |            EA | ê         
           235 |           235 |            EB | ë         
           102 |           102 |            66 | f         
           103 |           103 |            67 | g         
           104 |           104 |            68 | h         
           105 |           105 |            69 | i         
           236 |           236 |            EC | ì         
           237 |           237 |            ED | í         
           238 |           238 |            EE | î         
           239 |           239 |            EF | ï         
           106 |           106 |            6A | j         
           107 |           107 |            6B | k         
           108 |           108 |            6C | l         
           109 |           109 |            6D | m         
           110 |           110 |            6E | n         
           241 |           241 |            F1 | ñ         
           111 |           111 |            6F | o         
           242 |           242 |            F2 | ò         
           243 |           243 |            F3 | ó         
           244 |           244 |            F4 | ô         
           245 |           245 |            F5 | õ         
           246 |           246 |            F6 | ö         
           112 |           112 |            70 | p         
           113 |           113 |            71 | q         
           114 |           114 |            72 | r         
           115 |           115 |            73 | s         
           116 |           116 |            74 | t         
           117 |           117 |            75 | u         
           249 |           249 |            F9 | ù         
           250 |           250 |            FA | ú         
           251 |           251 |            FB | û         
           252 |           252 |            FC | ü         
           118 |           118 |            76 | v         
           119 |           119 |            77 | w         
           120 |           120 |            78 | x         
           121 |           121 |            79 | y         
           253 |           253 |            FD | ý         
           255 |           255 |            FF | ÿ         
           122 |           122 |            7A | z         

This is followed by a jumble of additional symbols (including currency), non-printing characters, a handful more alphabetic characters with accents, and numeric fractions and subscripts in no discernable order.

           123 |           123 |            7B | {         
           124 |           124 |            7C | |         
           125 |           125 |            7D | }         
           126 |           126 |            7E | ~         
           127 |           127 |            7F |          
           229 |           229 |            E5 | å         
           197 |           197 |            C5 | Å         
           230 |           230 |            E6 | æ         
           198 |           198 |            C6 | Æ         
           248 |           248 |            F8 | ø         
           163 |           163 |            A3 | £         
           216 |           216 |            D8 | Ø         
           159 |           159 |            9F | <?>       
           215 |           215 |            D7 | ×         
           191 |           191 |            BF | ¿         
           189 |           189 |            BD | ½         
           188 |           188 |            BC | ¼         
           161 |           161 |            A1 | ¡         
           171 |           171 |            AB | «         
           187 |           187 |            BB | »         
           128 |           128 |            80 | <?>       
           129 |           129 |            81 | <?>       
           130 |           130 |            82 | <?>       
           131 |           131 |            83 | <?>       
           132 |           132 |            84 | <?>       
           169 |           169 |            A9 | ©         
           133 |           133 |            85 | <?>       
           134 |           134 |            86 | <?>       
           135 |           135 |            87 | <?>       
           136 |           136 |            88 | <?>       
           162 |           162 |            A2 | ¢         
           165 |           165 |            A5 | ¥         
           137 |           137 |            89 | <?>       
           138 |           138 |            8A | <?>       
           139 |           139 |            8B | <?>       
           140 |           140 |            8C | <?>       
           141 |           141 |            8D | <?>       
           142 |           142 |            8E | <?>       
           143 |           143 |            8F | <?>       
           174 |           174 |            AE | ®         
           172 |           172 |            AC | ¬         
           144 |           144 |            90 | <?>       
           145 |           145 |            91 | <?>       
           146 |           146 |            92 | <?>       
           147 |           147 |            93 | <?>       
           148 |           148 |            94 | <?>       
           149 |           149 |            95 | <?>       
           150 |           150 |            96 | <?>       
           164 |           164 |            A4 | ¤         
           240 |           240 |            F0 | ð         
           208 |           208 |            D0 | Ð         
           151 |           151 |            97 | <?>       
           152 |           152 |            98 | <?>       
           153 |           153 |            99 | <?>       
           154 |           154 |            9A | <?>       
           155 |           155 |            9B | <?>       
           166 |           166 |            A6 | ¦         
           156 |           156 |            9C | <?>       
           170 |           170 |            AA | ª         
           186 |           186 |            BA | º         
           181 |           181 |            B5 | µ         
           254 |           254 |            FE | þ         
           222 |           222 |            DE | Þ         
           175 |           175 |            AF | ¯         
           180 |           180 |            B4 | ´         
           173 |           173 |            AD | ­         
           177 |           177 |            B1 | ±         
           157 |           157 |            9D | <?>       
           190 |           190 |            BE | ¾         
           182 |           182 |            B6 | ¶         
           167 |           167 |            A7 | §         
           247 |           247 |            F7 | ÷         
           184 |           184 |            B8 | ¸         
           176 |           176 |            B0 | °         
           168 |           168 |            A8 | ¨         
           183 |           183 |            B7 | ·         
           185 |           185 |            B9 | ¹         
           179 |           179 |            B3 | ³         
           178 |           178 |            B2 | ²         
           158 |           158 |            9E | <?>

Non-breaking space is always sorted last.

           160 |           160 |            A0 | <NBSP>    

So, to summarize the 4GL basic collation for ISO-8859-1 at a high level:

  1. NULL and SPACE first, interchangeably
  2. Non-printing/control characters, punctuation, symbols, and digits in ASCII order through codepoint 64
  3. Uppercase alphabetic letters, including those with accents/diacritical marks (which each have the same sorting weight for the same, base letter)
  4. A few more symbols
  5. Lowercase alphabetic letters, including those with accents/diacritical marks (which each have the same sorting weight for the same, base letter)
  6. Additional symbols (including currency) and non-printing characters
  7. Non-breaking space last

PostgreSQL

The database was created in a cluster built using the en_US.iso88591 locale provided by Ubuntu 16.04:

   Name    |  Owner   | Encoding |    Collate     |     Ctype      |   Access privileges   |  Size   | Tablespace |                Description                 
-----------+----------+----------+----------------+----------------+-----------------------+---------+------------+--------------------------------------------
 fwd       | postgres | LATIN1   | en_US.iso88591 | en_US.iso88591 |                       | 7104 kB | pg_default | 

LATIN1 is an alias for ISO-8859-1. Note that "Collate" is reported as en_US.iso88591, meaning that the collation behavior in this case comes from how the libc library works with the operating system's collation, defined by the en_US.iso88591 locale.

The default PostgreSQL collation for this codepage sorts all of the following characters before codepoint 1:

 codepoint src | codepoint tgt | hex codepoint | character 
---------------+---------------+---------------+-----------
           180 |           180 |          00B4 | ´         
            96 |            96 |          0060 | `         
            94 |            94 |          005E | ^         
           168 |           168 |          00A8 | ¨         
           126 |           126 |          007E | ~         
           247 |           247 |          00F7 | ÷         
           215 |           215 |          00D7 | ×         
            60 |            60 |          003C | <         
            61 |            61 |          003D | =         
            62 |            62 |          003E | >         
           172 |           172 |          00AC | ¬         
           124 |           124 |          007C | |         
           166 |           166 |          00A6 | ¦         
           176 |           176 |          00B0 | °         
           181 |           181 |          00B5 | µ         
            32 |            32 |          0020 | <SP>      
            95 |            95 |          005F | _         
           175 |           175 |          00AF | ¯         
           173 |           173 |          00AD | ­         
            45 |            45 |          002D | -         
            44 |            44 |          002C | ,         
            59 |            59 |          003B | ;         
            58 |            58 |          003A | :         
            33 |            33 |          0021 | !         
           161 |           161 |          00A1 | ¡         
            63 |            63 |          003F | ?         
           191 |           191 |          00BF | ¿         
            47 |            47 |          002F | /         
            46 |            46 |          002E | .         
           183 |           183 |          00B7 | ·         
           184 |           184 |          00B8 | ¸         
            39 |            39 |          0027 | '         
            34 |            34 |          0022 | "         
           171 |           171 |          00AB | «         
           187 |           187 |          00BB | »         
            40 |            40 |          0028 | (         
            41 |            41 |          0029 | )         
            91 |            91 |          005B | [         
            93 |            93 |          005D | ]         
           123 |           123 |          007B | {         
           125 |           125 |          007D | }         
           167 |           167 |          00A7 | §         
           182 |           182 |          00B6 | ¶         
           169 |           169 |          00A9 | ©         
           174 |           174 |          00AE | ®         
            64 |            64 |          0040 | @         
           164 |           164 |          00A4 | ¤         
           162 |           162 |          00A2 | ¢         
            36 |            36 |          0024 | $         
           163 |           163 |          00A3 | £         
           165 |           165 |          00A5 | ¥         
            42 |            42 |          002A | *         
            92 |            92 |          005C | \         
            38 |            38 |          0026 | &         
            35 |            35 |          0023 | #         
            37 |            37 |          0025 | %         
            43 |            43 |          002B | +         
           177 |           177 |          00B1 | ±         

These are punctuation marks and symbols (apologies, cutting and pasting into Redmine is losing some of the correct glyphs), including currency symbols. Notably, the SPACE character sorts significantly lower than in the OE basic collation, where it sorts at the highest position.

Next, we have a block of non-printing/control characters in codepoint order from codepoint 1 through 31:

             1 |             1 |          0001 | <?>       
             2 |             2 |          0002 | <?>       
             3 |             3 |          0003 | <?>       
             4 |             4 |          0004 | <?>       
             5 |             5 |          0005 | <?>       
             6 |             6 |          0006 | <?>       
             7 |             7 |          0007 | <?>       
             8 |             8 |          0008 | <?>       
             9 |             9 |          0009 | <?>       
            10 |            10 |          000A | <LF>      
            11 |            11 |          000B | <?>       
            12 |            12 |          000C | <?>       
            13 |            13 |          000D | <CR>      
            14 |            14 |          000E | <?>       
            15 |            15 |          000F | <?>       
            16 |            16 |          0010 | <?>       
            17 |            17 |          0011 | <?>       
            18 |            18 |          0012 | <?>       
            19 |            19 |          0013 | <?>       
            20 |            20 |          0014 | <?>       
            21 |            21 |          0015 | <?>       
            22 |            22 |          0016 | <?>       
            23 |            23 |          0017 | <?>       
            24 |            24 |          0018 | <?>       
            25 |            25 |          0019 | <?>       
            26 |            26 |          001A | <?>       
            27 |            27 |          001B | <?>       
            28 |            28 |          001C | <?>       
            29 |            29 |          001D | <?>       
            30 |            30 |          001E | <?>       
            31 |            31 |          001F | <?>       

This is followed by another block of non-printing characters in codepoint order from 127 through 159:

           127 |           127 |          007F |          
           128 |           128 |          0080 | <?>       
           129 |           129 |          0081 | <?>       
           130 |           130 |          0082 | <?>       
           131 |           131 |          0083 | <?>       
           132 |           132 |          0084 | <?>       
           133 |           133 |          0085 | <?>       
           134 |           134 |          0086 | <?>       
           135 |           135 |          0087 | <?>       
           136 |           136 |          0088 | <?>       
           137 |           137 |          0089 | <?>       
           138 |           138 |          008A | <?>       
           139 |           139 |          008B | <?>       
           140 |           140 |          008C | <?>       
           141 |           141 |          008D | <?>       
           142 |           142 |          008E | <?>       
           143 |           143 |          008F | <?>       
           144 |           144 |          0090 | <?>       
           145 |           145 |          0091 | <?>       
           146 |           146 |          0092 | <?>       
           147 |           147 |          0093 | <?>       
           148 |           148 |          0094 | <?>       
           149 |           149 |          0095 | <?>       
           150 |           150 |          0096 | <?>       
           151 |           151 |          0097 | <?>       
           152 |           152 |          0098 | <?>       
           153 |           153 |          0099 | <?>       
           154 |           154 |          009A | <?>       
           155 |           155 |          009B | <?>       
           156 |           156 |          009C | <?>       
           157 |           157 |          009D | <?>       
           158 |           158 |          009E | <?>       
           159 |           159 |          009F | <?>       

Next are digits 0-9, but interleaved with fractions and subscripts in a sensible order:

            48 |            48 |          0030 | 0         
           188 |           188 |          00BC | ¼         
           189 |           189 |          00BD | ½         
           190 |           190 |          00BE | ¾         
            49 |            49 |          0031 | 1         
           185 |           185 |          00B9 | ¹         
            50 |            50 |          0032 | 2         
           178 |           178 |          00B2 | ²         
            51 |            51 |          0033 | 3         
           179 |           179 |          00B3 | ³         
            52 |            52 |          0034 | 4         
            53 |            53 |          0035 | 5         
            54 |            54 |          0036 | 6         
            55 |            55 |          0037 | 7         
            56 |            56 |          0038 | 8         
            57 |            57 |          0039 | 9         

This is followed by the alphabet, including accented characters and those with diacritical marks:

            97 |            97 |          0061 | a         
            65 |            65 |          0041 | A         
           225 |           225 |          00E1 | á         
           193 |           193 |          00C1 | Á         
           224 |           224 |          00E0 | à         
           192 |           192 |          00C0 | À         
           226 |           226 |          00E2 | â         
           194 |           194 |          00C2 | Â         
           229 |           229 |          00E5 | å         
           197 |           197 |          00C5 | Å         
           228 |           228 |          00E4 | ä         
           196 |           196 |          00C4 | Ä         
           227 |           227 |          00E3 | ã         
           195 |           195 |          00C3 | Ã         
           170 |           170 |          00AA | ª         
           230 |           230 |          00E6 | æ         
           198 |           198 |          00C6 | Æ         
            98 |            98 |          0062 | b         
            66 |            66 |          0042 | B         
            99 |            99 |          0063 | c         
            67 |            67 |          0043 | C         
           231 |           231 |          00E7 | ç         
           199 |           199 |          00C7 | Ç         
           100 |           100 |          0064 | d         
            68 |            68 |          0044 | D         
           240 |           240 |          00F0 | ð         
           208 |           208 |          00D0 | Ð         
           101 |           101 |          0065 | e         
            69 |            69 |          0045 | E         
           233 |           233 |          00E9 | é         
           201 |           201 |          00C9 | É         
           232 |           232 |          00E8 | è         
           200 |           200 |          00C8 | È         
           234 |           234 |          00EA | ê         
           202 |           202 |          00CA | Ê         
           235 |           235 |          00EB | ë         
           203 |           203 |          00CB | Ë         
           102 |           102 |          0066 | f         
            70 |            70 |          0046 | F         
           103 |           103 |          0067 | g         
            71 |            71 |          0047 | G         
           104 |           104 |          0068 | h         
            72 |            72 |          0048 | H         
           105 |           105 |          0069 | i         
            73 |            73 |          0049 | I         
           237 |           237 |          00ED | í         
           205 |           205 |          00CD | Í         
           236 |           236 |          00EC | ì         
           204 |           204 |          00CC | Ì         
           238 |           238 |          00EE | î         
           206 |           206 |          00CE | Î         
           239 |           239 |          00EF | ï         
           207 |           207 |          00CF | Ï         
           106 |           106 |          006A | j         
            74 |            74 |          004A | J         
           107 |           107 |          006B | k         
            75 |            75 |          004B | K         
           108 |           108 |          006C | l         
            76 |            76 |          004C | L         
           109 |           109 |          006D | m         
            77 |            77 |          004D | M         
           110 |           110 |          006E | n         
            78 |            78 |          004E | N         
           241 |           241 |          00F1 | ñ         
           209 |           209 |          00D1 | Ñ         
           111 |           111 |          006F | o         
            79 |            79 |          004F | O         
           243 |           243 |          00F3 | ó         
           211 |           211 |          00D3 | Ó         
           242 |           242 |          00F2 | ò         
           210 |           210 |          00D2 | Ò         
           244 |           244 |          00F4 | ô         
           212 |           212 |          00D4 | Ô         
           246 |           246 |          00F6 | ö         
           214 |           214 |          00D6 | Ö         
           245 |           245 |          00F5 | õ         
           213 |           213 |          00D5 | Õ         
           248 |           248 |          00F8 | ø         
           216 |           216 |          00D8 | Ø         
           186 |           186 |          00BA | º         
           112 |           112 |          0070 | p         
            80 |            80 |          0050 | P         
           113 |           113 |          0071 | q         
            81 |            81 |          0051 | Q         
           114 |           114 |          0072 | r         
            82 |            82 |          0052 | R         
           115 |           115 |          0073 | s         
            83 |            83 |          0053 | S         
           223 |           223 |          00DF | ß         
           116 |           116 |          0074 | t         
            84 |            84 |          0054 | T         
           117 |           117 |          0075 | u         
            85 |            85 |          0055 | U         
           250 |           250 |          00FA | ú         
           218 |           218 |          00DA | Ú         
           249 |           249 |          00F9 | ù         
           217 |           217 |          00D9 | Ù         
           251 |           251 |          00FB | û         
           219 |           219 |          00DB | Û         
           252 |           252 |          00FC | ü         
           220 |           220 |          00DC | Ü         
           118 |           118 |          0076 | v         
            86 |            86 |          0056 | V         
           119 |           119 |          0077 | w         
            87 |            87 |          0057 | W         
           120 |           120 |          0078 | x         
            88 |            88 |          0058 | X         
           121 |           121 |          0079 | y         
            89 |            89 |          0059 | Y         
           253 |           253 |          00FD | ý         
           221 |           221 |          00DD | Ý         
           255 |           255 |          00FF | ÿ         
           122 |           122 |          007A | z         
            90 |            90 |          005A | Z         

Note that for each letter (including each accented form), the lowercase character sorts first, then the uppercase form of the same character sorts next. The accented forms sort consistently whether the secondary sort by codepoint is ascending or desceding.

In fact, it should be noted that no two codepoints in this codepage sort with the same weight; the collation of every code point w.r.t. the others is deterministic.

Next, there is a small block of two forms of the same character:

           254 |           254 |          00FE | þ         
           222 |           222 |          00DE | Þ         

As with the 4GL collation, this collation ends with non-breaking space:

           160 |           160 |          00A0 | <NBSP>    

I should note that the NULL character (codepoint 0x00) is missing from the PostgreSQL test output, because inserting this character from JDBC raised an error:

java.sql.BatchUpdateException: Batch entry 0 insert into tt1(codepoint_ori, encoded_cs, encoded_ci) values (0, '', '') was aborted: ERROR: invalid byte sequence for encoding "UTF8": 0x00  Call getNextException to see other errors in the batch.
    at org.postgresql.jdbc.BatchResultHandler.handleCompletion(BatchResultHandler.java:166)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:492)
    at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:840)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1538)
    at sandbox.CollationTest.main(CollationTest.java:424)
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183)
    at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1299)
    at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1324)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:467)
    ... 3 more

So, to summarize this collation:

  1. Punctuation marks and symbols (including currency symbols and the SPACE character, which sorts 16th vs. tied for 1st in the 4GL basic collation)
  2. Non-printing/control characters in codepoint order from codepoint 1 through 31
  3. Non-printing characters in codepoint order from 127 through 159
  4. Digits 0-9, interleaved with fractions and subscripts
  5. Alphabetic characters, including accent/diacritical marks, sorting lowercase first for each letter variant
  6. Two forms of the "thorn" character
  7. Non-breaking space

Comparison Notes

Looking at the summaries for each collation provided above, we can see that major categories of characters are organized very differently between the basic 4GL collation and the default PostgreSQL collation.

The 4GL collates accented forms of a letter with the same weight as every other form of that letter, including the base letter.

All uppercase letters sort before all lowercase letters in the 4GL collation. PostgreSQL sorts each lowercase letter before each associated uppercase letter. So, it is important to note that in case-sensitive collations, these two systems sort uppercase and lowercase in the opposite directions.

One thing that cannot be seen here is that SQL NULL (i.e., not the NULL character 0x00, but the absence of data in the SQL sense or unknown value in the 4GL sense), sorts before non-NULL data in both systems.

#19 Updated by Ovidiu Maxiniuc almost 3 years ago

Eric,
The space character (0x20, or 32 decimal) is incorrectly listed in the 2nd position. This is because the compare operation is done 4GL style, after trimming so practically it is equal to empty string. To be sure it 'goes' to the right position, it must be suffixed (and possible prefixed) by some fixed character so that the correct compare is performed.

"a" + field + "z"

Since "a" and "z" are constant, they are 'ignored' when ordered. There is a catch here: the NULL (0x00) character which makes this string 2 char wide instead of 3 so the compound string will be order near 'z' in this case. I avoided this by comparing characters starting at 1 instead, and forcing NULL to be always first by convention.

#20 Updated by Ovidiu Maxiniuc almost 3 years ago

The listing you obtained is pretty much the one we currently have in locale/xx_XX@ISO8859-1@basic@FWD and src/com/goldencode/p2j/spi/ISO_8859_1Collator.java. With two differences:
  1. the 0x20 (space) issue I described in my previous note;
  2. characters 0x53 (S) and 0xDF (ß). Since 0xDF (ß) is a lowercase character (please correct me if I am wrong), according the ASCII/ISO standards, it should sort after the uppercase 0x53 (S). I do not understand why is this difference.

#21 Updated by Eric Faulhaber almost 3 years ago

Ovidiu Maxiniuc wrote:

The listing you obtained is pretty much the one we currently have in locale/xx_XX@ISO8859-1@basic@FWD and src/com/goldencode/p2j/spi/ISO_8859_1Collator.java. With two differences:
  1. the 0x20 (space) issue I described in my previous note;
  2. characters 0x53 (S) and 0xDF (ß). Since 0xDF (ß) is a lowercase character (please correct me if I am wrong), according the ASCII/ISO standards, it should sort after the uppercase 0x53 (S). I do not understand why is this difference.

The 4GL case-sensitive, basic collation for ISO-8859-1 (en_US) does not seem to consider 0xDF (ß) to represent a lowercase character, since it sorts ß between 0x53 (S) and 0x54 (T), not between 0x73 (s) and 0x74 (t). The letter represents a "sharp S" (Eszett) in German. Anglicized, it is replaced with ss (I'm not sure if SS is used, unless text is in all caps, since this letter isn't used at the start of words). So, if you consider the Anglicized form or the fact that it doesn't seem to exist as a capital letter, treating it as lowercase makes logical sense. Nevertheless, it is sorted by this collation in the uppercase letters.

Note that in the case-insensitive form of the 4GL's ISO-8859-1 basic collation (en_US), it sorts after s and S (this pair is weighted evenly) and before t and T (also weighted evenly). I am in the process of analyzing and documenting the case-insensitive form of the collation now.

#22 Updated by Eric Faulhaber almost 3 years ago

  • % Done changed from 0 to 100
  • Tracker changed from Feature to Support
  • Status changed from WIP to Review

This issue became primarily about documenting the differences in database collation between the legacy environment and the default PostgreSQL environment (i.e., without a FWD custom collation implementation). That documentation is now in a wiki page.

#23 Updated by Greg Shah almost 3 years ago

I reviewed the page and it is good for the initial purpose of this task. I've sent the customer the link.

We have another customer requirement for this task: to understand what are the implications of moving from OE with Windows 1252 to PostgreSQL UTF-8. This is not as direct as the previous analysis. I think there are multiple aspects:

  • What is the behavior difference between OE UTF-8 basic and PostgreSQL UTF-8 basic?
    • Ignore ICU.
    • For the current needs, we would focus on the charset (256 characters) from Windows 1252. We may also want to add any missing chars from 8859-1 and 8859-15 which are also charsets that we support today. The point here is to limit the analysis to the characters that currently matter.
    • Can we implement OE compatibility in PostgreSQL and/or H2? If so, how to do it/what needs to be done?
  • Using the understanding of both OE UTF-8 basic and PostgreSQL UTF-8 basic, we can summarize the differences between OE 1252 and OE UTF-8 basic and PostgreSQL UTF-8 basic. Some customers are considering a move to UTF-8 as part of a move to FWD. This will help customers understand the implications so they can assess this decision. The current customer that is planning this move would initially plan to move to an OE UTF-8 basic compatible implementation of PostgreSQL. But they may even need to use Amazon RDS or Aurora in the future, which would potentially drive them to the standard PostgreSQL UTF-8 basic. So while it is not strictly needed here, please summarize the implications of that move as well.

#24 Updated by Greg Shah almost 3 years ago

I would like to have the same 8859-1 analysis done for the other two codepages/collations which we are supporting (8859-15 and Windows 1252). This is not time-critical since we don't have a customer that needs it immediately.

We have an open issue in regard to documenting making PostgreSQL run with the proper 1252 charset. See 5183-10. The problem is that we can't properly make (or haven't figured out how to use) a standard Linux locale which supports Windows 1252 AND also uses the 1252 charset (instead of the LATIN1/8859-1 charset). Without that, there is no comparison environment in "standard locale" (i.e. no use of custom locales) PostgreSQL because the characters can't be properly imported. So we need to resolve this first before the analysis can be finished for 1252.

#25 Updated by Greg Shah almost 3 years ago

  • Status changed from Review to WIP
  • % Done changed from 100 to 70

#26 Updated by Ovidiu Maxiniuc almost 3 years ago

With latest FWD support for locale/collations, if the PG cluster is created like this:

sudo -u postgres pg_createcluster -p 5430 9.5 my_cluster --locale=nl_NL@cp1252_fwd_basic --encoding=win1252 -d /mnt/ssd/pd_data/9.5/my_cluster/

then we can have databases with following configuration:

mydb=# \l+
                                                                               List of databases
   Name    |   Owner   | Encoding |        Collate         |         Ctype          |   Access privileges   |  Size   | Tablespace |                Description                 
-----------+-----------+----------+------------------------+------------------------+-----------------------+---------+------------+--------------------------------------------
 mydb      | fwd_admin | WIN1252  | nl_NL@cp1252_fwd_basic | nl_NL@cp1252_fwd_basic |                       | 168 GB  | pg_default | 

Isn't this enough?

#27 Updated by Greg Shah almost 3 years ago

Isn't this enough?

It is enough for any customer that is able to implement the custom locales. It won't work for any customer that must use an environment where the PostgreSQL installation is maintained by someone else (e.g. Amazon RDS or Amazon Aurora). In such a scenario, the customer must know what the differences in collation will be for our 4GL-compatible 1252 and the "standard" 1252 seen in the standard PostgreSQL install. That is the analysis that Eric did for 8859-1. He can't do the same analysis for 1252 until we figure out the way to make PostgreSQL run with a standard 1252 encoding that has the 1252 charset instead of LATIN1. This is the issue in #5183-10 and we know of no solution yet.

#28 Updated by Ovidiu Maxiniuc almost 3 years ago

I do not think this is a problem. Please follow this steps:
  1. edit /etc/locale.gen (as root) and uncomment en_US.CP1252 CP1252 if not already uncommented;
  2. execute sudo locale-gen --no-purge --no-archive. It should identify the new lines and compile only the new locales. locale -a should print a list containing en_US.cp1252.
  3. create the cluster using
    sudo -u postgres pg_createcluster -p 5439 9.5 test_1252 --locale=en_US.cp1252 --encoding=win1252 -d /mnt/ssd/pd_data/9.5/test_1252/
    
  4. execute sudo pg_ctlcluster 9.5 test_1252 start. The new cluster should be visible (in grren) using pg_lsclusters as:
    Ver Cluster    Port Status Owner    Data directory                   Log file
    9.5 test_1252  5439 online postgres /mnt/ssd/pd_data/9.5/test_1252/  /var/log/postgresql/postgresql-9.5-test_1252.log
    
  5. connect (sudo -u postgres psql -p 5439) and create a sample database:
    postgres=# create database test_1252;
    CREATE DATABASE
    
  6. See how it looks from inside:
    postgres=# \l+
                                                                         List of databases
       Name    |  Owner   | Encoding |   Collate    |    Ctype     |   Access privileges   |  Size   | Tablespace |                Description                 
    -----------+----------+----------+--------------+--------------+-----------------------+---------+------------+--------------------------------------------
     postgres  | postgres | WIN1252  | en_US.cp1252 | en_US.cp1252 |                       | 6960 kB | pg_default | default administrative connection database
     template0 | postgres | WIN1252  | en_US.cp1252 | en_US.cp1252 | =c/postgres          +| 6841 kB | pg_default | unmodifiable empty database
               |          |          |              |              | postgres=CTc/postgres |         |            | 
     template1 | postgres | WIN1252  | en_US.cp1252 | en_US.cp1252 | =c/postgres          +| 6841 kB | pg_default | default template for new databases
               |          |          |              |              | postgres=CTc/postgres |         |            | 
     test_1252 | postgres | WIN1252  | en_US.cp1252 | en_US.cp1252 |                       | 6841 kB | pg_default | 
    (4 rows)
    

As you can see, there is nothing related to FWD.

Also available in: Atom PDF