Support #5183
database collation
70%
Related issues
History
#1 Updated by Eric Faulhaber about 3 years ago
A review of how FWD collates its database data has raised some questions as to whether we are doing it correctly for a given character set and language. The first combination we are reviewing is en_US
with the 4GL's default ISO-8859-1
character set.
#2 Updated by Eric Faulhaber about 3 years ago
The attached program (cp_coll.p
) was written as a starting point to review the 4GL's collation, inspired by the existing FWD tool, locale/gen-locale.p
. A small, local database (coll
) must be created on the target system, using the attached schema. The program must be run connected to the database. It expects one parameter: the base file name (no extension). A series of comparable output files will be created using this base name as a prefix. The files show the first 256 code points of a character set, both in code point order and in collated order by encoded character. Four files are created:
- output in code point order without any use within a database;
- output in code point order, where the characters of the code page have been stored in a temp-table;
- output in encoded character order, where the characters of the code page have been stored in a temp-table;
- output in encoded character order, where the characters of the code page have been stored in a persistent table (
chardata
table in thecoll
database).
#3 Updated by Greg Shah about 3 years ago
In the "Internationalize ABL Application" docs there are good details about codepages and all the character configuration tables (e.g. collation tables) and much more. I think we need to look at that content more carefully.
I noticed this comment:
"The figure shows that the collation table has two sections, one for case-insensitive sorts and one for case-sensitive sorts."
That suggests you may need to try multiple tests. One sorting upon a case-sensitive field and another on a case-insensitive field.
#4 Updated by Greg Shah about 3 years ago
Some additional research to consider:
1. Locales
A locale is the standard way for specifying the localized behavior of an internationalized (I18N) application or system. Locales are used by UNIX, Linux, Windows and also by application subsystems such as Java and PostgreSQL. In POSIX environments, a single environment variable LANG
sets the locale. The format of LANG
specifies the language, territory (usually a country) and a codeset (often called a codepage or "charset" or "character encoding"). In the US, users of Linux often will have en_US.UTF-8
as the LANG
. This article why is a charset included with the locale has some interesting history of why the locale is a combination of these different factors.
The locale controls the code page used, the collation settings (which on POSIX systems can be individually set by the LC_COLLATE
environment variable), number formatting (LC_NUMERIC
), time/date formatting (LC_TIME
), currency (LC_MONETARY
) and so on.
IEEE and the Open Group have standardized locales, including how to define them. The latest standard is POSIX.1.2017 (see Chapter 6 "Character Set" and Chapter 7 "Locale"). The older 1003.1 standard from 2004 has Chapter 6 Character Set and Chapter 7 Locale. The standard defines how locales should work including the definition file formats and the localedef
utility for turning text definitions into the binary files needed at runtime.
The POSIX support also defines an API for using that support. As an example, the standard C runtime includes this API.
Non-POSIX operating systems like Windows have their own locale support. For Windows, the support is proprietary instead of standards/POSIX based. For WIN32 locale support, see Locales and Languages. Note that as of Windows Vista, it is possible to create custom locales. It is hard to understand why it took that long to provide that capability.
2. 4GL Code Pages, convmap.cp
and Character Processing Tables
The 4GL does not honor the concept of a locale. It does not appear to use the operating system I18N features, except possibly the low level code page support. For most of the support it has its own proprietary implementation.
For text support, it provides a limited set of configurable capabilities using "Character Processing Tables":
- character attribute table (defines which code points are alphabetic vs not alphabetic)
- case table (maps how each character is uppercased and lowercased)
- collation (defines the sort order for case-sensitive character data and a separate order for case-insensitive character data)
- code page conversion (defines mappings for converting from some code pages into other code pages)
- word break (rules for how word indexes split character data into words)
Please note that the character processing tables seem to only have 256 slots. I see no documented way to customize the built-in character processing for multi-byte or Unicode code-pages. The 4GL does provide some support for both multi-byte and Unicode code-pages, but it seems that some level of extra application processing/awareness is needed. For example, the ChUI client does not directly support UTF-8 but can convert from/to UTF-8 in I/O. On the other hand, the GUI client does support UTF-8 as its internal code-page. The programmer must be aware of which language features and which 4GL client types support these multi-byte/Unicode code-pages.
The configuration definition exists in a convmap.dat
file, which is a list of supported code-pages and each code-page's associated character processing tables (excluding word break tables). These text definitions in (and included by) convmap.dat
are "compiled" into a convmap.cp
binary definition using proutil
.
At startup of the OE client, the configuration of the session is defined as follows:
- Use the default
convmap.cp
file from the installation; OR - Find a
convmap.cp
via thePROCONV
environment variable; OR - Find the
convmap.cp
from the-convmap
startup parameter.
There are separate settings to select which portions of the convmap.cp
are in effect.
- code-pages
- Code-pages can be specified for the internal processing of the 4GL and for some different I/O features. The current settings can be discovered using the
SESSION
handle and the the following read-only attributes:CPINTERNAL
(internal processing),CPLOG
(log output),CPPRINT
(printer output),CPRCODEIN
(r-code text segment reading),CPRCODEOUT
(r-code text segment writing),CPSTREAM
(stream I/O),CPTERM
(terminal output). - The specific code-page used for each of these can be set on the command line using
-cpinternal
,-cplog
,-cpprint
,-cprcodein
,-cprcodeout
and-cpstream
,-cpterm
. You cannot set these values after startup. - The documentation suggests that the following are defaults (if no command line overrides occur):
CPINTERNAL
defaults to ISO8859-1CPSTREAM
andCPLOG
default toIBM850
- none of the others are documented
- Please note that ISO8859-1 is not the same as Windows 1252 (which is really closer to ISO8859-15) and neither are the same as IBM850.
- Code-pages can be specified for the internal processing of the 4GL and for some different I/O features. The current settings can be discovered using the
- case processing table
- In the
convmap.cp
, for each listed code-page there can be 1 case processing table specified by name. - The case processing table in use defaults to
BASIC
. - This can be read (but not set) via
SESSION:CPCASE
. - At startup, it can be set via
-cpcase
.
- In the
- collation table
- In the
convmap.cp
, for each listed code-page there can be 1 collation table specified by name. - The collation table in use defaults to
BASIC
. - This can be read (but not set) via
SESSION:CPCOLL
. - At startup, it can be set via
-cpcoll
.
- In the
- character attribute table
- In the
convmap.cp
, for each listed code-page there can be 1 character attribute table. It does not have a name. - As far as I know it cannot be chosen at startup and it cannot be queried via a
SESSION
attribute.
- In the
- code-page conversion tables
- It seems that these are mostly for converting from some related code-page into a main code-page.
- As such, when they appear in
convmap.cp
, they may not be configured for the main code-pagem but rather on the related ones. - The documentation is not clear and I haven't investigated this.
- These cannot be set at startup and there is no way to query it from the
SESSION
handle.
- language
- The value reported by the
CURRENT-LANGUAGE
built-in function controls which text segment of the r-code strings are read from at runitme. - It can be set at startup using
-lng
or at runtime using theCURRENT-LANGUAGE
statement. See #3817 for more details. - As far as I know, it does not otherwise affect character data processing.
- The value reported by the
The other parts of a standard locale are either not present in the 4GL (e.g. currency support, most date/time support...) OR there is some limited proprietary feature. For example:
- date/time
- The date order (e.g. usually set to
MDY
orDMY
) can be set via command line-d
orSESSION:DATE-FORMAT
. - The Y2K windowing year (start of 100 year range for yy formats) is set with
-yy
orSESSION:YEAR-OFFSET
. It defaults to 1950. - Whether
EXPORT
,PUT UNFORMATTED
andMESSAGE
output 4 digit years instead of the default 2 can be set via-yr4def
on the command line.
- The date order (e.g. usually set to
- numbers
- The
SESSION:NUMERIC-FORMAT
will report "American" orEuropean" and defaults to "American". It can be written at runtime and also changed via @SESSION:SET-NUMERIC-FORMAT()
method. -E
can force European number formatting from the command line.- The decimal point character can be set with
-numdec
and can be read (but not written) usingSESSION:NUMERIC-DECIMAL-POINT
. - The group separator character can be set with
-numsep
and can be read (but not written) usingSESSION:NUMERIC-SEPARATOR
.
- The
I think that is the extent of the configurable I18N support for 4GL sessions.
The database character processing is handled a bit differently. It still seems to have a concept of the character processing tables, but they are defined in .df
files. You pick an empty database for the core "language" (though I think it is more correct to say that the code-page is chosen) and then load in any customizations via a .df
. From there you can add your tables/fields/indexes. There is also a process by which one can change code pages for an existing database. Once set (or modified), all other character processing at the database is done implicitly.
The following are 4GL language features which can exhibit collation behavior:
- The relational comparison operators (
LT
,LTE
,GT
,GTE
) when used with text data. - The
COMPARE()
builtin function. - Explicit query sorting using a
BY
clause. - Overriding collation table usage in a query (
FOR
,OPEN QUERY
,DO/REPEAT PRESELECT
) with aCOLLATE
clause.
• Implicit query sorting from an index.
The origin of the collation tables for the 4GL is completely unclear. An important aspect of this task is to see if we can find a proper match to some industry standard so that we don't have to maintain the custom locale for PostgreSQL.
Note that there are aspects of specific language collation in real language use which cannot be represented as a table of 256 single character sort order positions. Such an approach can order individual characters but it misses more complex multi-character ordering that is possible. POSIX collation is more complex and can handle some amount of this real world complexity.
#5 Updated by Greg Shah about 3 years ago
- Related to Support #4080: AWS Aurora support added
#6 Updated by Eric Faulhaber about 3 years ago
I've committed a newer version of the collation data collector program to 3821c/12160.
The locale/cp_coll/
directory contains the program, the exported schema needed when creating the small test database (coll.df
), and a readme with setup and use instructions.
#7 Updated by Eric Faulhaber almost 3 years ago
- Assignee set to Eric Faulhaber
- Status changed from New to WIP
#8 Updated by Eric Faulhaber almost 3 years ago
I've reviewed the output of the cp_coll.p
program from several environments:
- en_US / 8859-1
- Linux
- Windows
- en_GB / 8859-1
- Linux
- Windows
- nl_NL / CP1252
- Windows
For the Windows runs, both GUI and ChUI samples were captured. Only ChUI samples were available for the Linux runs.
I found that within a locale, character set, and case-sensitivity, the collation of temp-table fields seems to be stable. That is, the characters seem to sort in the same order on every run.
The collation of persistent tables is not as stable. From run to run, we can see that bands of characters sort in a stable order, but the individual characters within those bands sort differently from run to run. This indicates that the characters within a band compare equally during a sort operation, and something else (rowid, it seems) governs the sort within those bands.
For case-sensitive fields, the same character in the same case, but with different diacritical marks, can sort interchangeably. For example, see the following two bands of characters from two persistent table samples with lang/locale en_US
and cpstream/cpinternal of ISO8959-1
:
Sample 1:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- ... 65 | 41 | A | a | A | 0x0000000000001b21 194 | C2 | Â | â | Â | 0x000000000000218d 195 | C3 | Ã | ã | Ã | 0x000000000000218e 196 | C4 | Ä | ä | Ä | 0x000000000000218f 192 | C0 | À | à | À | 0x0000000000002d3e 193 | C1 | Á | á | Á | 0x0000000000002d3f ... 224 | E0 | à | à | À | 0x0000000000000ad4 225 | E1 | á | á | Á | 0x0000000000000ad5 226 | E2 | â | â | Â | 0x0000000000000ad6 227 | E3 | ã | ã | Ã | 0x0000000000000ad7 228 | E4 | ä | ä | Ä | 0x0000000000000ad8 97 | 61 | a | a | A | 0x0000000000001641 ...
Sample 2:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- ... 192 | C0 | À | à | À | 0x0000000000001660 193 | C1 | Á | á | Á | 0x0000000000001661 194 | C2 | Â | â | Â | 0x0000000000001662 195 | C3 | Ã | ã | Ã | 0x0000000000001663 196 | C4 | Ä | ä | Ä | 0x0000000000001664 65 | 41 | A | a | A | 0x00000000000024e1 ... 97 | 61 | a | a | A | 0x0000000000001b21 226 | E2 | â | â | Â | 0x000000000000218d 227 | E3 | ã | ã | Ã | 0x000000000000218e 228 | E4 | ä | ä | Ä | 0x000000000000218f 224 | E0 | à | à | À | 0x0000000000002d3e 225 | E1 | á | á | Á | 0x0000000000002d3f ...
For case-insensitive fields, the same character, but in a different case or with different diacritical marks, can sort interchangeably. For example, see the following band of characters from two samples:
Sample 1:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- ... 224 | E0 | à | à | À | 0x0000000000000ad4 225 | E1 | á | á | Á | 0x0000000000000ad5 226 | E2 | â | â | Â | 0x0000000000000ad6 227 | E3 | ã | ã | Ã | 0x0000000000000ad7 228 | E4 | ä | ä | Ä | 0x0000000000000ad8 97 | 61 | a | a | A | 0x0000000000001641 65 | 41 | A | a | A | 0x0000000000001b21 194 | C2 | Â | â | Â | 0x000000000000218d 195 | C3 | Ã | ã | Ã | 0x000000000000218e 196 | C4 | Ä | ä | Ä | 0x000000000000218f 192 | C0 | À | à | À | 0x0000000000002d3e 193 | C1 | Á | á | Á | 0x0000000000002d3f ...
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- ... 192 | C0 | À | à | À | 0x0000000000001660 193 | C1 | Á | á | Á | 0x0000000000001661 194 | C2 | Â | â | Â | 0x0000000000001662 195 | C3 | Ã | ã | Ã | 0x0000000000001663 196 | C4 | Ä | ä | Ä | 0x0000000000001664 97 | 61 | a | a | A | 0x0000000000001b21 226 | E2 | â | â | Â | 0x000000000000218d 227 | E3 | ã | ã | Ã | 0x000000000000218e 228 | E4 | ä | ä | Ä | 0x000000000000218f 65 | 41 | A | a | A | 0x00000000000024e1 224 | E0 | à | à | À | 0x0000000000002d3e 225 | E1 | á | á | Á | 0x0000000000002d3f ...
Regardless of case-sensitivity, some bands of other characters seem to sort interchangeably. The following samples happen to be from a case-insensitive sort, but the same behavior can be seen within case-sensitive sorts.
Sample 1:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- 32 | 20 | <SP> | <SP> | <SP> | 0x00000000000024e0 0 | 00 | <NULL> | <NULL> | <NULL> | 0x0000000000002a00 ... 215 | D7 | × | × | × | 0x0000000000000acb 159 | 9F | <?> | <?> | <?> | 0x0000000000001dff ...
Sample 1:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- 0 | 00 | <NULL> | <NULL> | <NULL> | 0x0000000000000160 32 | 20 | <SP> | <SP> | <SP> | 0x0000000000002a00 ... 159 | 9F | <?> | <?> | <?> | 0x000000000000165f 215 | D7 | × | × | × | 0x0000000000001677 ...
The characters in each of these bands seem to evaluate to the same weight in terms of collation, and so the collation within those bands appears to be undefined. We do not see this same variability within the temp-tables samples. All of the above samples of characters sort as follows in all of the temp-table samples from the same environment...
Case-sensitive:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- 0 | 00 | <NULL> | <NULL> | <NULL> | 0x0000000000000900 32 | 20 | <SP> | <SP> | <SP> | 0x0000000000000920 ... 65 | 41 | A | a | A | 0x0000000000000941 192 | C0 | À | à | À | 0x0000000000000a1f 193 | C1 | Á | á | Á | 0x0000000000000a20 194 | C2 | Â | â | Â | 0x0000000000000a21 195 | C3 | Ã | ã | Ã | 0x0000000000000a22 196 | C4 | Ä | ä | Ä | 0x0000000000000a23 ... 97 | 61 | a | a | A | 0x0000000000000961 224 | E0 | à | à | À | 0x0000000000000a3f 225 | E1 | á | á | Á | 0x0000000000000a40 226 | E2 | â | â | Â | 0x0000000000000a41 227 | E3 | ã | ã | Ã | 0x0000000000000a42 228 | E4 | ä | ä | Ä | 0x0000000000000a43 ... 159 | 9F | <?> | <?> | <?> | 0x000000000000099f 215 | D7 | × | × | × | 0x0000000000000a36 ...
Case-insensitive:
codepoint | hex | character | lower | upper | rowid -----------+-----+-----------+----------+----------+-------------------- 0 | 00 | <NULL> | <NULL> | <NULL> | 0x0000000000000900 32 | 20 | <SP> | <SP> | <SP> | 0x0000000000000920 ... 65 | 41 | A | a | A | 0x0000000000000941 97 | 61 | a | a | A | 0x0000000000000961 192 | C0 | À | à | À | 0x0000000000000a1f 193 | C1 | Á | á | Á | 0x0000000000000a20 194 | C2 | Â | â | Â | 0x0000000000000a21 195 | C3 | Ã | ã | Ã | 0x0000000000000a22 196 | C4 | Ä | ä | Ä | 0x0000000000000a23 224 | E0 | à | à | À | 0x0000000000000a3f 225 | E1 | á | á | Á | 0x0000000000000a40 226 | E2 | â | â | Â | 0x0000000000000a41 227 | E3 | ã | ã | Ã | 0x0000000000000a42 228 | E4 | ä | ä | Ä | 0x0000000000000a43 ... 159 | 9F | <?> | <?> | <?> | 0x000000000000099f 215 | D7 | × | × | × | 0x0000000000000a36 ...
I think the relative stability of the temp-table runs is just an illusion, though, because it seems that the "tie-breaker" when two characters sort at the same weight is the rowid assigned to the record. Note that the program generates the records three times and sorts them three times. If we just generate the records once and query them three times, the sort order is identical across the queries, for both temp-tables and persistent tables. Thus, the variability seems to be introduced in the record creation, rather than in the query sort.
The working theory is that the algorithm which assigns rowids for temp-tables is more likely to use sequential values as the records are created, whereas rowids assigned for persistent tables tend to vary more, so these can get out of sequence, even if records are created in the same order. Thus, certain bands of characters will appear to have a stable rowid sort order from run to run with temp-tables, whereas these same characters may appear to have a less stable rowid sort order from run to run with persistent tables.
Bearing in mind the above, I think it is safe to use the temp-table results as the basis for some conclusions about collation. These seem to be the "most canonical" versions of the results, as the rowid order tracks the codepoint order, and I think the codepoint order is a reasonable secondary sort criterion.
#9 Updated by Eric Faulhaber almost 3 years ago
- Related to Feature #5491: custom locale/collation implementations added
#10 Updated by Eric Faulhaber almost 3 years ago
- File CollationTest.java added
Ovidiu, I need your help.
In any of your encoding testing lately, have you been able to create a database cluster which uses CP1252
as the encoding
, lc_ctype
, and lc_collate
type? I don't mean a custom locale; I need to do this using the stock Linux locales.
When I try the following, I get an error:
ecf@ecf:~/eclipse-workspace/sandbox$ sudo pg_createcluster 9.5 test_en_US_cp1252b --locale=en_US.cp1252 --encoding CP1252 Creating new cluster 9.5/test_en_US_cp1252b ... config /etc/postgresql/9.5/test_en_US_cp1252b data /var/lib/postgresql/9.5/test_en_US_cp1252b locale en_US.cp1252 could not change directory to "/home/ecf/eclipse-workspace/sandbox": Permission denied initdb: "CP1252" is not a valid server encoding name Error: initdb failed
So, this makes sense, since --encoding
must refer to one of the character sets PostgreSQL supports. So, I tried:
ecf@ecf:~/eclipse-workspace/sandbox$ sudo pg_createcluster 9.5 test_en_US_cp1252b --locale=en_US.cp1252 --encoding WIN1252 Creating new cluster 9.5/test_en_US_cp1252b ... config /etc/postgresql/9.5/test_en_US_cp1252b data /var/lib/postgresql/9.5/test_en_US_cp1252b locale en_US.cp1252 could not change directory to "/home/ecf/eclipse-workspace/sandbox": Permission denied initdb: encoding mismatch The encoding you selected (WIN1252) and the encoding that the selected locale uses (LATIN1) do not match. This would lead to misbehavior in various character string processing functions. Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. Error: initdb failed
I could only get it to work if I leave off the --encoding
parameter, but then it uses LATIN1
as the character set:
fwd=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+--------------+--------------+-----------------------+---------+------------+-------------------------------------------- fwd | postgres | LATIN1 | en_US.cp1252 | en_US.cp1252 | | 7048 kB | pg_default | postgres | postgres | LATIN1 | en_US.cp1252 | en_US.cp1252 | | 7040 kB | pg_default | default administrative connection database template0 | postgres | LATIN1 | en_US.cp1252 | en_US.cp1252 | =c/postgres +| 6865 kB | pg_default | unmodifiable empty database | | | | | postgres=CTc/postgres | | | template1 | postgres | LATIN1 | en_US.cp1252 | en_US.cp1252 | =c/postgres +| 6865 kB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | (4 rows)
Of course, this leads to a familiar-ish problem when I try to insert a record with an invalid character for LATIN1
from JDBC:
ecf@ecf:~/eclipse-workspace/sandbox$ java -classpath bin:/home/ecf/projects/p2j/build/lib/postgresql-42.2.5.jar sandbox.CollationTest cp1252 1 5437 java.sql.BatchUpdateException: Batch entry 127 insert into tt1(codepoint_ori, encoded_cs, encoded_ci) values (128, '€', '€') was aborted: ERROR: character with byte sequence 0xe2 0x82 0xac in encoding "UTF8" has no equivalent in encoding "LATIN1" Call getNextException to see other errors in the batch. at org.postgresql.jdbc.BatchResultHandler.handleCompletion(BatchResultHandler.java:166) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:492) at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:840) at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1538) at sandbox.CollationTest.main(CollationTest.java:357) Caused by: org.postgresql.util.PSQLException: ERROR: character with byte sequence 0xe2 0x82 0xac in encoding "UTF8" has no equivalent in encoding "LATIN1" at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:481) ... 3 more
I don't have the same problem when the target database is UTF8:
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+-------------+-------------+-----------------------+---------+------------+-------------------------------------------- fwd | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 7272 kB | pg_default |
My Java test program (attached for reference) loops from code points 1 to 255, generating a String
in the specified character set (in this case cp1252
) from each code point. Each string is inserted via JDBC into a temporary table with 3 columns:
- the original code point (so, just an integer from 1 to 255);
- a case-sensitive character (type
text
); - a case-insensitive character (type
citext
).
This is done in the fwd
test database listed above (see the class javadoc for how I set this up).
When we get to the Euro character, it blows up because of the LATIN1
encoding. Is this a bug in the Linux en_US.cp1252
locale? I don't understand why it pairs LATIN1
(i.e., ISO-8859-1
) encoding with this locale. Isn't the point of the CP1252
locale to manage the characters from the CP1252
code page?
Do you see something obvious I'm doing wrong here, or is what I'm trying to do just not feasible? Thanks.
#11 Updated by Ovidiu Maxiniuc almost 3 years ago
Eric,
I have not tried to use the en_US.cp1252
locale until now. Only the custom locales. However, when I try to do it now, I get some strange errors:
om@ProDesk400G3MT:/home/om$ sudo pg_createcluster 9.5 test_en_US_cp1252b --locale=en_US.CP1252 --encoding CP1252 Creating new PostgreSQL cluster 9.5/test_en_US_cp1252b ... /usr/lib/postgresql/9.5/bin/initdb -D /var/lib/postgresql/9.5/test_en_US_cp1252b --auth-local peer --auth-host md5 --encoding CP1252 --locale en_US.CP1252 The files belonging to this database system will be owned by user "postgres". This user must also own the server process. initdb: invalid locale name "en_US.CP1252" Error: initdb failed
Do I need to add it manually to my system?
#12 Updated by Roger Borrello almost 3 years ago
A general question about the use of a custom locale. The database for an existing application would need to be dropped as well as the cluster, then the cluster rebuilt using the custom locale, and the database recreated?
Or is there a migration possible?
#13 Updated by Eric Faulhaber almost 3 years ago
Ovidiu Maxiniuc wrote:
Eric,
I have not tried to use theen_US.cp1252
locale until now. Only the custom locales. However, when I try to do it now, I get some strange errors:
[...]Do I need to add it manually to my system?
Most likely. The non-UTF8 locales I needed for my testing were not installed by default on my Ubuntu 16.04 system, so I used the locale-gen
utility (first adding locales for CP1252, ISO-8859-1, and ISO-8859-15 for en_US, en_GB, and nl_NL to the /etc/locale.gen
config file) to generate them. Unfortunately, this blows away our custom locales, so I had to re-generate those afterward.
These were the specific lines I added to /etc/locale.gen
, based on some Googling:
en_GB.CP1252 CP1252 en_GB ISO-8859-1 en_GB.ISO-8859-15 ISO-8859-15 en_US.CP1252 CP1252 en_US ISO-8859-1 en_US.ISO-8859-15 ISO-8859-15 nl_NL.CP1252 CP1252 nl_NL ISO-8859-1 nl_NL.ISO-8859-15 ISO-8859-15
#14 Updated by Eric Faulhaber almost 3 years ago
Roger Borrello wrote:
A general question about the use of a custom locale. The database for an existing application would need to be dropped as well as the cluster, then the cluster rebuilt using the custom locale, and the database recreated?
Yes. You don't technically have to drop the old database and cluster, but you might as well, if it was based on the old en_US@p2j_basic
custom locale.
Or is there a migration possible?
Not at this time.
See also the updates to these areas of the documentation:
#15 Updated by Ovidiu Maxiniuc almost 3 years ago
Eric Faulhaber wrote:
These were the specific lines I added to
/etc/locale.gen
, based on some Googling:
[...]
I added those line (one or two were already there and I uncommented them). But I am stuck here. locale -a
does not list them. Next, I tried
$ sudo locale-gen en_US.CP1252 Error: 'en_US.CP1252' is not a supported language or locale
I
grep
-ped /usr/share/i18n/SUPPORTED
but I do not have any occurrences of 1252
.#16 Updated by Eric Faulhaber almost 3 years ago
Do you have /usr/share/i18n/charmaps/CP1252.gz
? If so, gunzip
it. I think locale-gen
just runs localedef
under the covers, so it has to be able to read the charmap. But, I would have thought you already have this unzipped from generating the custom locales...
I
grep
-ped/usr/share/i18n/SUPPORTED
but I do not have any occurrences of1252
.
Hm, I don't either, but somehow I got locale-gen
to create it. I may have used en_US CP1252
(instead of en_US.CP1252 CP1252
) in my /etc/locale.gen
file. I wonder if this is related to my problems now.
#17 Updated by Greg Shah almost 3 years ago
locale -a
does not list them
If localedef
was successful there should be new locale files. If that is the case, then make sure the permissions are OK. You can try sudo locale -a
and if there is a difference, then permissions/ownership is the problem.
#18 Updated by Eric Faulhaber almost 3 years ago
- File cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_codepoint_order_20210702.txt added
- File postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_2.txt added
- File postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_1.txt added
- File postgresql_iso-8859-1_to_latin1_codepoint_en_US.iso88591_20210702_1.txt added
- File cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_2.txt added
- File cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_1.txt added
Comparison of OE ISO-8859-1 basic collation vs. PostgreSQL ISO-8859-1 default collation (en_US)¶
I created ISO-8859-1 databases in OE and PostgreSQL and loaded the first 255 encoded characters into a table with a codepoint (integer) field/column, a case-sensitive character field, and a case-insensitive character field. Then I sorted by one of the character fields.
The generated files are attached:
- 4GL
- sorted by codepoint (
cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_codepoint_order_20210702.txt
) - sorted first by case-sensitive character ascending, then by codepoint ascending (
cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_1.txt
) - sorted first by case-sensitive character ascending, then by codepoint descending (
cp_coll_en_US_iso8859-1_to_basic_iso8859-1_basic_win32_chui_11.6_perm_case_sensitive_20210702_2.txt
)
- sorted by codepoint (
- PostgreSQL
- sorted by codepoint (
postgresql_iso-8859-1_to_latin1_codepoint_en_US.iso88591_20210702_1.txt
) - sorted first by case-sensitive character ascending, then by codepoint ascending (
postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_1.txt
) - sorted first by case-sensitive character ascending, then by codepoint descending (
postgresql_iso-8859-1_to_latin1_case-sensitive_en_US.iso88591_20210702_2.txt
)
- sorted by codepoint (
Case-Sensitive¶
4GL¶
The 4GL basic collation for this codepage sorts nearly in codepoint order up through code point 64:
codepoint src | codepoint tgt | hex codepoint | character ---------------+---------------+---------------+----------- 0 | 0 | 00 | <NULL> 32 | 32 | 20 | <SP> 1 | 1 | 01 | <?> 2 | 2 | 02 | <?> 3 | 3 | 03 | <?> 4 | 4 | 04 | <?> 5 | 5 | 05 | <?> 6 | 6 | 06 | <?> 7 | 7 | 07 | <?> 8 | 8 | 08 | <?> 9 | 9 | 09 | <?> 10 | 10 | 0A | <LF> 11 | 11 | 0B | <?> 12 | 12 | 0C | <?> 13 | 13 | 0D | <CR> 14 | 14 | 0E | <?> 15 | 15 | 0F | <?> 16 | 16 | 10 | <?> 17 | 17 | 11 | <?> 18 | 18 | 12 | <?> 19 | 19 | 13 | <?> 20 | 20 | 14 | <?> 21 | 21 | 15 | <?> 22 | 22 | 16 | <?> 23 | 23 | 17 | <?> 24 | 24 | 18 | <?> 25 | 25 | 19 | <?> 26 | 26 | 1A | <?> 27 | 27 | 1B | <?> 28 | 28 | 1C | <?> 29 | 29 | 1D | <?> 30 | 30 | 1E | <?> 31 | 31 | 1F | <?> 33 | 33 | 21 | ! 34 | 34 | 22 | " 35 | 35 | 23 | # 36 | 36 | 24 | $ 37 | 37 | 25 | % 38 | 38 | 26 | & 39 | 39 | 27 | ' 40 | 40 | 28 | ( 41 | 41 | 29 | ) 42 | 42 | 2A | * 43 | 43 | 2B | + 44 | 44 | 2C | , 45 | 45 | 2D | - 46 | 46 | 2E | . 47 | 47 | 2F | / 48 | 48 | 30 | 0 49 | 49 | 31 | 1 50 | 50 | 32 | 2 51 | 51 | 33 | 3 52 | 52 | 34 | 4 53 | 53 | 35 | 5 54 | 54 | 36 | 6 55 | 55 | 37 | 7 56 | 56 | 38 | 8 57 | 57 | 39 | 9 58 | 58 | 3A | : 59 | 59 | 3B | ; 60 | 60 | 3C | < 61 | 61 | 3D | = 62 | 62 | 3E | > 63 | 63 | 3F | ? 64 | 64 | 40 | @
This includes many non-printing/control characters, punctuation, symbols, and digits 0-9. Only the SPACE
character (codepoint 32) is moved to sort first (with the same weight as NULL
).
The 4GL sorts the capital letters of the alphabet next, including all alphabetical characters with diacritical marks:
65 | 65 | 41 | A 192 | 192 | C0 | À 193 | 193 | C1 | Á 194 | 194 | C2 | Â 195 | 195 | C3 | Ã 196 | 196 | C4 | Ä 66 | 66 | 42 | B 67 | 67 | 43 | C 199 | 199 | C7 | Ç 68 | 68 | 44 | D 69 | 69 | 45 | E 200 | 200 | C8 | È 201 | 201 | C9 | É 202 | 202 | CA | Ê 203 | 203 | CB | Ë 70 | 70 | 46 | F 71 | 71 | 47 | G 72 | 72 | 48 | H 73 | 73 | 49 | I 204 | 204 | CC | Ì 205 | 205 | CD | Í 206 | 206 | CE | Î 207 | 207 | CF | Ï 74 | 74 | 4A | J 75 | 75 | 4B | K 76 | 76 | 4C | L 77 | 77 | 4D | M 78 | 78 | 4E | N 209 | 209 | D1 | Ñ 79 | 79 | 4F | O 210 | 210 | D2 | Ò 211 | 211 | D3 | Ó 212 | 212 | D4 | Ô 213 | 213 | D5 | Õ 214 | 214 | D6 | Ö 80 | 80 | 50 | P 81 | 81 | 51 | Q 82 | 82 | 52 | R 83 | 83 | 53 | S 223 | 223 | DF | ß 84 | 84 | 54 | T 85 | 85 | 55 | U 217 | 217 | D9 | Ù 218 | 218 | DA | Ú 219 | 219 | DB | Û 220 | 220 | DC | Ü 86 | 86 | 56 | V 87 | 87 | 57 | W 88 | 88 | 58 | X 89 | 89 | 59 | Y 221 | 221 | DD | Ý 90 | 90 | 5A | Z
It is worth noting that for each letter, variants with accents/diacritical marks sort with the same weight. If we add a secondary sort by codepoint, we can see this. For example, sorting primarily by character ascending, and secondarily by codepoint ascending yields the following for variants of A
:
65 | 65 | 41 | A 192 | 192 | C0 | À 193 | 193 | C1 | Á 194 | 194 | C2 | Â 195 | 195 | C3 | Ã 196 | 196 | C4 | Ä
Reversing the secondary sort to descending by codepoint yields:
196 | 196 | C4 | Ä 195 | 195 | C3 | Ã 194 | 194 | C2 | Â 193 | 193 | C1 | Á 192 | 192 | C0 | À 65 | 65 | 41 | A
Next, there is a small block of symbols:
91 | 91 | 5B | [ 92 | 92 | 5C | \ 93 | 93 | 5D | ] 94 | 94 | 5E | ^ 95 | 95 | 5F | _ 96 | 96 | 60 | `
This is followed by the lowercase alphabet. The same rule regarding accented characters applies:
97 | 97 | 61 | a 224 | 224 | E0 | à 225 | 225 | E1 | á 226 | 226 | E2 | â 227 | 227 | E3 | ã 228 | 228 | E4 | ä 98 | 98 | 62 | b 99 | 99 | 63 | c 231 | 231 | E7 | ç 100 | 100 | 64 | d 101 | 101 | 65 | e 232 | 232 | E8 | è 233 | 233 | E9 | é 234 | 234 | EA | ê 235 | 235 | EB | ë 102 | 102 | 66 | f 103 | 103 | 67 | g 104 | 104 | 68 | h 105 | 105 | 69 | i 236 | 236 | EC | ì 237 | 237 | ED | í 238 | 238 | EE | î 239 | 239 | EF | ï 106 | 106 | 6A | j 107 | 107 | 6B | k 108 | 108 | 6C | l 109 | 109 | 6D | m 110 | 110 | 6E | n 241 | 241 | F1 | ñ 111 | 111 | 6F | o 242 | 242 | F2 | ò 243 | 243 | F3 | ó 244 | 244 | F4 | ô 245 | 245 | F5 | õ 246 | 246 | F6 | ö 112 | 112 | 70 | p 113 | 113 | 71 | q 114 | 114 | 72 | r 115 | 115 | 73 | s 116 | 116 | 74 | t 117 | 117 | 75 | u 249 | 249 | F9 | ù 250 | 250 | FA | ú 251 | 251 | FB | û 252 | 252 | FC | ü 118 | 118 | 76 | v 119 | 119 | 77 | w 120 | 120 | 78 | x 121 | 121 | 79 | y 253 | 253 | FD | ý 255 | 255 | FF | ÿ 122 | 122 | 7A | z
This is followed by a jumble of additional symbols (including currency), non-printing characters, a handful more alphabetic characters with accents, and numeric fractions and subscripts in no discernable order.
123 | 123 | 7B | { 124 | 124 | 7C | | 125 | 125 | 7D | } 126 | 126 | 7E | ~ 127 | 127 | 7F | 229 | 229 | E5 | å 197 | 197 | C5 | Å 230 | 230 | E6 | æ 198 | 198 | C6 | Æ 248 | 248 | F8 | ø 163 | 163 | A3 | £ 216 | 216 | D8 | Ø 159 | 159 | 9F | <?> 215 | 215 | D7 | × 191 | 191 | BF | ¿ 189 | 189 | BD | ½ 188 | 188 | BC | ¼ 161 | 161 | A1 | ¡ 171 | 171 | AB | « 187 | 187 | BB | » 128 | 128 | 80 | <?> 129 | 129 | 81 | <?> 130 | 130 | 82 | <?> 131 | 131 | 83 | <?> 132 | 132 | 84 | <?> 169 | 169 | A9 | © 133 | 133 | 85 | <?> 134 | 134 | 86 | <?> 135 | 135 | 87 | <?> 136 | 136 | 88 | <?> 162 | 162 | A2 | ¢ 165 | 165 | A5 | ¥ 137 | 137 | 89 | <?> 138 | 138 | 8A | <?> 139 | 139 | 8B | <?> 140 | 140 | 8C | <?> 141 | 141 | 8D | <?> 142 | 142 | 8E | <?> 143 | 143 | 8F | <?> 174 | 174 | AE | ® 172 | 172 | AC | ¬ 144 | 144 | 90 | <?> 145 | 145 | 91 | <?> 146 | 146 | 92 | <?> 147 | 147 | 93 | <?> 148 | 148 | 94 | <?> 149 | 149 | 95 | <?> 150 | 150 | 96 | <?> 164 | 164 | A4 | ¤ 240 | 240 | F0 | ð 208 | 208 | D0 | Ð 151 | 151 | 97 | <?> 152 | 152 | 98 | <?> 153 | 153 | 99 | <?> 154 | 154 | 9A | <?> 155 | 155 | 9B | <?> 166 | 166 | A6 | ¦ 156 | 156 | 9C | <?> 170 | 170 | AA | ª 186 | 186 | BA | º 181 | 181 | B5 | µ 254 | 254 | FE | þ 222 | 222 | DE | Þ 175 | 175 | AF | ¯ 180 | 180 | B4 | ´ 173 | 173 | AD | 177 | 177 | B1 | ± 157 | 157 | 9D | <?> 190 | 190 | BE | ¾ 182 | 182 | B6 | ¶ 167 | 167 | A7 | § 247 | 247 | F7 | ÷ 184 | 184 | B8 | ¸ 176 | 176 | B0 | ° 168 | 168 | A8 | ¨ 183 | 183 | B7 | · 185 | 185 | B9 | ¹ 179 | 179 | B3 | ³ 178 | 178 | B2 | ² 158 | 158 | 9E | <?>
Non-breaking space is always sorted last.
160 | 160 | A0 | <NBSP>
So, to summarize the 4GL basic collation for ISO-8859-1 at a high level:
NULL
andSPACE
first, interchangeably- Non-printing/control characters, punctuation, symbols, and digits in ASCII order through codepoint 64
- Uppercase alphabetic letters, including those with accents/diacritical marks (which each have the same sorting weight for the same, base letter)
- A few more symbols
- Lowercase alphabetic letters, including those with accents/diacritical marks (which each have the same sorting weight for the same, base letter)
- Additional symbols (including currency) and non-printing characters
- Non-breaking space last
PostgreSQL¶
The database was created in a cluster built using the en_US.iso88591
locale provided by Ubuntu 16.04:
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+----------------+----------------+-----------------------+---------+------------+-------------------------------------------- fwd | postgres | LATIN1 | en_US.iso88591 | en_US.iso88591 | | 7104 kB | pg_default |
LATIN1
is an alias for ISO-8859-1
. Note that "Collate" is reported as en_US.iso88591
, meaning that the collation behavior in this case comes from how the libc
library works with the operating system's collation, defined by the en_US.iso88591
locale.
The default PostgreSQL collation for this codepage sorts all of the following characters before codepoint 1:
codepoint src | codepoint tgt | hex codepoint | character ---------------+---------------+---------------+----------- 180 | 180 | 00B4 | ´ 96 | 96 | 0060 | ` 94 | 94 | 005E | ^ 168 | 168 | 00A8 | ¨ 126 | 126 | 007E | ~ 247 | 247 | 00F7 | ÷ 215 | 215 | 00D7 | × 60 | 60 | 003C | < 61 | 61 | 003D | = 62 | 62 | 003E | > 172 | 172 | 00AC | ¬ 124 | 124 | 007C | | 166 | 166 | 00A6 | ¦ 176 | 176 | 00B0 | ° 181 | 181 | 00B5 | µ 32 | 32 | 0020 | <SP> 95 | 95 | 005F | _ 175 | 175 | 00AF | ¯ 173 | 173 | 00AD | 45 | 45 | 002D | - 44 | 44 | 002C | , 59 | 59 | 003B | ; 58 | 58 | 003A | : 33 | 33 | 0021 | ! 161 | 161 | 00A1 | ¡ 63 | 63 | 003F | ? 191 | 191 | 00BF | ¿ 47 | 47 | 002F | / 46 | 46 | 002E | . 183 | 183 | 00B7 | · 184 | 184 | 00B8 | ¸ 39 | 39 | 0027 | ' 34 | 34 | 0022 | " 171 | 171 | 00AB | « 187 | 187 | 00BB | » 40 | 40 | 0028 | ( 41 | 41 | 0029 | ) 91 | 91 | 005B | [ 93 | 93 | 005D | ] 123 | 123 | 007B | { 125 | 125 | 007D | } 167 | 167 | 00A7 | § 182 | 182 | 00B6 | ¶ 169 | 169 | 00A9 | © 174 | 174 | 00AE | ® 64 | 64 | 0040 | @ 164 | 164 | 00A4 | ¤ 162 | 162 | 00A2 | ¢ 36 | 36 | 0024 | $ 163 | 163 | 00A3 | £ 165 | 165 | 00A5 | ¥ 42 | 42 | 002A | * 92 | 92 | 005C | \ 38 | 38 | 0026 | & 35 | 35 | 0023 | # 37 | 37 | 0025 | % 43 | 43 | 002B | + 177 | 177 | 00B1 | ±
These are punctuation marks and symbols (apologies, cutting and pasting into Redmine is losing some of the correct glyphs), including currency symbols. Notably, the SPACE
character sorts significantly lower than in the OE basic collation, where it sorts at the highest position.
Next, we have a block of non-printing/control characters in codepoint order from codepoint 1 through 31:
1 | 1 | 0001 | <?> 2 | 2 | 0002 | <?> 3 | 3 | 0003 | <?> 4 | 4 | 0004 | <?> 5 | 5 | 0005 | <?> 6 | 6 | 0006 | <?> 7 | 7 | 0007 | <?> 8 | 8 | 0008 | <?> 9 | 9 | 0009 | <?> 10 | 10 | 000A | <LF> 11 | 11 | 000B | <?> 12 | 12 | 000C | <?> 13 | 13 | 000D | <CR> 14 | 14 | 000E | <?> 15 | 15 | 000F | <?> 16 | 16 | 0010 | <?> 17 | 17 | 0011 | <?> 18 | 18 | 0012 | <?> 19 | 19 | 0013 | <?> 20 | 20 | 0014 | <?> 21 | 21 | 0015 | <?> 22 | 22 | 0016 | <?> 23 | 23 | 0017 | <?> 24 | 24 | 0018 | <?> 25 | 25 | 0019 | <?> 26 | 26 | 001A | <?> 27 | 27 | 001B | <?> 28 | 28 | 001C | <?> 29 | 29 | 001D | <?> 30 | 30 | 001E | <?> 31 | 31 | 001F | <?>
This is followed by another block of non-printing characters in codepoint order from 127 through 159:
127 | 127 | 007F | 128 | 128 | 0080 | <?> 129 | 129 | 0081 | <?> 130 | 130 | 0082 | <?> 131 | 131 | 0083 | <?> 132 | 132 | 0084 | <?> 133 | 133 | 0085 | <?> 134 | 134 | 0086 | <?> 135 | 135 | 0087 | <?> 136 | 136 | 0088 | <?> 137 | 137 | 0089 | <?> 138 | 138 | 008A | <?> 139 | 139 | 008B | <?> 140 | 140 | 008C | <?> 141 | 141 | 008D | <?> 142 | 142 | 008E | <?> 143 | 143 | 008F | <?> 144 | 144 | 0090 | <?> 145 | 145 | 0091 | <?> 146 | 146 | 0092 | <?> 147 | 147 | 0093 | <?> 148 | 148 | 0094 | <?> 149 | 149 | 0095 | <?> 150 | 150 | 0096 | <?> 151 | 151 | 0097 | <?> 152 | 152 | 0098 | <?> 153 | 153 | 0099 | <?> 154 | 154 | 009A | <?> 155 | 155 | 009B | <?> 156 | 156 | 009C | <?> 157 | 157 | 009D | <?> 158 | 158 | 009E | <?> 159 | 159 | 009F | <?>
Next are digits 0-9, but interleaved with fractions and subscripts in a sensible order:
48 | 48 | 0030 | 0 188 | 188 | 00BC | ¼ 189 | 189 | 00BD | ½ 190 | 190 | 00BE | ¾ 49 | 49 | 0031 | 1 185 | 185 | 00B9 | ¹ 50 | 50 | 0032 | 2 178 | 178 | 00B2 | ² 51 | 51 | 0033 | 3 179 | 179 | 00B3 | ³ 52 | 52 | 0034 | 4 53 | 53 | 0035 | 5 54 | 54 | 0036 | 6 55 | 55 | 0037 | 7 56 | 56 | 0038 | 8 57 | 57 | 0039 | 9
This is followed by the alphabet, including accented characters and those with diacritical marks:
97 | 97 | 0061 | a 65 | 65 | 0041 | A 225 | 225 | 00E1 | á 193 | 193 | 00C1 | Á 224 | 224 | 00E0 | à 192 | 192 | 00C0 | À 226 | 226 | 00E2 | â 194 | 194 | 00C2 | Â 229 | 229 | 00E5 | å 197 | 197 | 00C5 | Å 228 | 228 | 00E4 | ä 196 | 196 | 00C4 | Ä 227 | 227 | 00E3 | ã 195 | 195 | 00C3 | Ã 170 | 170 | 00AA | ª 230 | 230 | 00E6 | æ 198 | 198 | 00C6 | Æ 98 | 98 | 0062 | b 66 | 66 | 0042 | B 99 | 99 | 0063 | c 67 | 67 | 0043 | C 231 | 231 | 00E7 | ç 199 | 199 | 00C7 | Ç 100 | 100 | 0064 | d 68 | 68 | 0044 | D 240 | 240 | 00F0 | ð 208 | 208 | 00D0 | Ð 101 | 101 | 0065 | e 69 | 69 | 0045 | E 233 | 233 | 00E9 | é 201 | 201 | 00C9 | É 232 | 232 | 00E8 | è 200 | 200 | 00C8 | È 234 | 234 | 00EA | ê 202 | 202 | 00CA | Ê 235 | 235 | 00EB | ë 203 | 203 | 00CB | Ë 102 | 102 | 0066 | f 70 | 70 | 0046 | F 103 | 103 | 0067 | g 71 | 71 | 0047 | G 104 | 104 | 0068 | h 72 | 72 | 0048 | H 105 | 105 | 0069 | i 73 | 73 | 0049 | I 237 | 237 | 00ED | í 205 | 205 | 00CD | Í 236 | 236 | 00EC | ì 204 | 204 | 00CC | Ì 238 | 238 | 00EE | î 206 | 206 | 00CE | Î 239 | 239 | 00EF | ï 207 | 207 | 00CF | Ï 106 | 106 | 006A | j 74 | 74 | 004A | J 107 | 107 | 006B | k 75 | 75 | 004B | K 108 | 108 | 006C | l 76 | 76 | 004C | L 109 | 109 | 006D | m 77 | 77 | 004D | M 110 | 110 | 006E | n 78 | 78 | 004E | N 241 | 241 | 00F1 | ñ 209 | 209 | 00D1 | Ñ 111 | 111 | 006F | o 79 | 79 | 004F | O 243 | 243 | 00F3 | ó 211 | 211 | 00D3 | Ó 242 | 242 | 00F2 | ò 210 | 210 | 00D2 | Ò 244 | 244 | 00F4 | ô 212 | 212 | 00D4 | Ô 246 | 246 | 00F6 | ö 214 | 214 | 00D6 | Ö 245 | 245 | 00F5 | õ 213 | 213 | 00D5 | Õ 248 | 248 | 00F8 | ø 216 | 216 | 00D8 | Ø 186 | 186 | 00BA | º 112 | 112 | 0070 | p 80 | 80 | 0050 | P 113 | 113 | 0071 | q 81 | 81 | 0051 | Q 114 | 114 | 0072 | r 82 | 82 | 0052 | R 115 | 115 | 0073 | s 83 | 83 | 0053 | S 223 | 223 | 00DF | ß 116 | 116 | 0074 | t 84 | 84 | 0054 | T 117 | 117 | 0075 | u 85 | 85 | 0055 | U 250 | 250 | 00FA | ú 218 | 218 | 00DA | Ú 249 | 249 | 00F9 | ù 217 | 217 | 00D9 | Ù 251 | 251 | 00FB | û 219 | 219 | 00DB | Û 252 | 252 | 00FC | ü 220 | 220 | 00DC | Ü 118 | 118 | 0076 | v 86 | 86 | 0056 | V 119 | 119 | 0077 | w 87 | 87 | 0057 | W 120 | 120 | 0078 | x 88 | 88 | 0058 | X 121 | 121 | 0079 | y 89 | 89 | 0059 | Y 253 | 253 | 00FD | ý 221 | 221 | 00DD | Ý 255 | 255 | 00FF | ÿ 122 | 122 | 007A | z 90 | 90 | 005A | Z
Note that for each letter (including each accented form), the lowercase character sorts first, then the uppercase form of the same character sorts next. The accented forms sort consistently whether the secondary sort by codepoint is ascending or desceding.
In fact, it should be noted that no two codepoints in this codepage sort with the same weight; the collation of every code point w.r.t. the others is deterministic.
Next, there is a small block of two forms of the same character:
254 | 254 | 00FE | þ 222 | 222 | 00DE | Þ
As with the 4GL collation, this collation ends with non-breaking space:
160 | 160 | 00A0 | <NBSP>
I should note that the NULL character (codepoint 0x00) is missing from the PostgreSQL test output, because inserting this character from JDBC raised an error:
java.sql.BatchUpdateException: Batch entry 0 insert into tt1(codepoint_ori, encoded_cs, encoded_ci) values (0, '', '') was aborted: ERROR: invalid byte sequence for encoding "UTF8": 0x00 Call getNextException to see other errors in the batch. at org.postgresql.jdbc.BatchResultHandler.handleCompletion(BatchResultHandler.java:166) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:492) at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:840) at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1538) at sandbox.CollationTest.main(CollationTest.java:424) Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183) at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1299) at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1324) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:467) ... 3 more
So, to summarize this collation:
- Punctuation marks and symbols (including currency symbols and the
SPACE
character, which sorts 16th vs. tied for 1st in the 4GL basic collation) - Non-printing/control characters in codepoint order from codepoint 1 through 31
- Non-printing characters in codepoint order from 127 through 159
- Digits 0-9, interleaved with fractions and subscripts
- Alphabetic characters, including accent/diacritical marks, sorting lowercase first for each letter variant
- Two forms of the "thorn" character
- Non-breaking space
Comparison Notes¶
Looking at the summaries for each collation provided above, we can see that major categories of characters are organized very differently between the basic 4GL collation and the default PostgreSQL collation.
The 4GL collates accented forms of a letter with the same weight as every other form of that letter, including the base letter.
All uppercase letters sort before all lowercase letters in the 4GL collation. PostgreSQL sorts each lowercase letter before each associated uppercase letter. So, it is important to note that in case-sensitive collations, these two systems sort uppercase and lowercase in the opposite directions.
One thing that cannot be seen here is that SQL NULL (i.e., not the NULL character 0x00, but the absence of data in the SQL sense or unknown value in the 4GL sense), sorts before non-NULL data in both systems.
#19 Updated by Ovidiu Maxiniuc almost 3 years ago
Eric,
The space character (0x20
, or 32
decimal) is incorrectly listed in the 2nd position. This is because the compare operation is done 4GL style, after trimming so practically it is equal to empty string. To be sure it 'goes' to the right position, it must be suffixed (and possible prefixed) by some fixed character so that the correct compare is performed.
"a" + field + "z"
Since
"a"
and "z"
are constant, they are 'ignored' when ordered. There is a catch here: the NULL
(0x00
) character which makes this string 2 char wide instead of 3 so the compound string will be order near 'z'
in this case. I avoided this by comparing characters starting at 1 instead, and forcing NULL
to be always first by convention.#20 Updated by Ovidiu Maxiniuc almost 3 years ago
locale/xx_XX@ISO8859-1@basic@FWD
and src/com/goldencode/p2j/spi/ISO_8859_1Collator.java
. With two differences:
- the
0x20
(space) issue I described in my previous note; - characters
0x53
(S
) and0xDF
(ß
). Since0xDF
(ß
) is a lowercase character (please correct me if I am wrong), according the ASCII/ISO standards, it should sort after the uppercase0x53
(S
). I do not understand why is this difference.
#21 Updated by Eric Faulhaber almost 3 years ago
Ovidiu Maxiniuc wrote:
The listing you obtained is pretty much the one we currently have inlocale/xx_XX@ISO8859-1@basic@FWD
andsrc/com/goldencode/p2j/spi/ISO_8859_1Collator.java
. With two differences:
- the
0x20
(space) issue I described in my previous note;- characters
0x53
(S
) and0xDF
(ß
). Since0xDF
(ß
) is a lowercase character (please correct me if I am wrong), according the ASCII/ISO standards, it should sort after the uppercase0x53
(S
). I do not understand why is this difference.
The 4GL case-sensitive, basic collation for ISO-8859-1 (en_US) does not seem to consider 0xDF
(ß
) to represent a lowercase character, since it sorts ß
between 0x53
(S
) and 0x54
(T
), not between 0x73
(s
) and 0x74
(t
). The letter represents a "sharp S" (Eszett) in German. Anglicized, it is replaced with ss
(I'm not sure if SS
is used, unless text is in all caps, since this letter isn't used at the start of words). So, if you consider the Anglicized form or the fact that it doesn't seem to exist as a capital letter, treating it as lowercase makes logical sense. Nevertheless, it is sorted by this collation in the uppercase letters.
Note that in the case-insensitive form of the 4GL's ISO-8859-1 basic collation (en_US), it sorts after s
and S
(this pair is weighted evenly) and before t
and T
(also weighted evenly). I am in the process of analyzing and documenting the case-insensitive form of the collation now.
#22 Updated by Eric Faulhaber almost 3 years ago
- % Done changed from 0 to 100
- Tracker changed from Feature to Support
- Status changed from WIP to Review
This issue became primarily about documenting the differences in database collation between the legacy environment and the default PostgreSQL environment (i.e., without a FWD custom collation implementation). That documentation is now in a wiki page.
#23 Updated by Greg Shah almost 3 years ago
I reviewed the page and it is good for the initial purpose of this task. I've sent the customer the link.
We have another customer requirement for this task: to understand what are the implications of moving from OE with Windows 1252 to PostgreSQL UTF-8. This is not as direct as the previous analysis. I think there are multiple aspects:
- What is the behavior difference between OE UTF-8 basic and PostgreSQL UTF-8 basic?
- Ignore ICU.
- For the current needs, we would focus on the charset (256 characters) from Windows 1252. We may also want to add any missing chars from 8859-1 and 8859-15 which are also charsets that we support today. The point here is to limit the analysis to the characters that currently matter.
- Can we implement OE compatibility in PostgreSQL and/or H2? If so, how to do it/what needs to be done?
- Using the understanding of both OE UTF-8 basic and PostgreSQL UTF-8 basic, we can summarize the differences between OE 1252 and OE UTF-8 basic and PostgreSQL UTF-8 basic. Some customers are considering a move to UTF-8 as part of a move to FWD. This will help customers understand the implications so they can assess this decision. The current customer that is planning this move would initially plan to move to an OE UTF-8 basic compatible implementation of PostgreSQL. But they may even need to use Amazon RDS or Aurora in the future, which would potentially drive them to the standard PostgreSQL UTF-8 basic. So while it is not strictly needed here, please summarize the implications of that move as well.
#24 Updated by Greg Shah almost 3 years ago
I would like to have the same 8859-1 analysis done for the other two codepages/collations which we are supporting (8859-15 and Windows 1252). This is not time-critical since we don't have a customer that needs it immediately.
We have an open issue in regard to documenting making PostgreSQL run with the proper 1252 charset. See 5183-10. The problem is that we can't properly make (or haven't figured out how to use) a standard Linux locale which supports Windows 1252 AND also uses the 1252 charset (instead of the LATIN1/8859-1 charset). Without that, there is no comparison environment in "standard locale" (i.e. no use of custom locales) PostgreSQL because the characters can't be properly imported. So we need to resolve this first before the analysis can be finished for 1252.
#25 Updated by Greg Shah almost 3 years ago
- Status changed from Review to WIP
- % Done changed from 100 to 70
#26 Updated by Ovidiu Maxiniuc almost 3 years ago
With latest FWD support for locale/collations, if the PG cluster is created like this:
sudo -u postgres pg_createcluster -p 5430 9.5 my_cluster --locale=nl_NL@cp1252_fwd_basic --encoding=win1252 -d /mnt/ssd/pd_data/9.5/my_cluster/
then we can have databases with following configuration:
mydb=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+-----------+----------+------------------------+------------------------+-----------------------+---------+------------+-------------------------------------------- mydb | fwd_admin | WIN1252 | nl_NL@cp1252_fwd_basic | nl_NL@cp1252_fwd_basic | | 168 GB | pg_default |
Isn't this enough?
#27 Updated by Greg Shah almost 3 years ago
Isn't this enough?
It is enough for any customer that is able to implement the custom locales. It won't work for any customer that must use an environment where the PostgreSQL installation is maintained by someone else (e.g. Amazon RDS or Amazon Aurora). In such a scenario, the customer must know what the differences in collation will be for our 4GL-compatible 1252 and the "standard" 1252 seen in the standard PostgreSQL install. That is the analysis that Eric did for 8859-1. He can't do the same analysis for 1252 until we figure out the way to make PostgreSQL run with a standard 1252 encoding that has the 1252 charset instead of LATIN1. This is the issue in #5183-10 and we know of no solution yet.
#28 Updated by Ovidiu Maxiniuc almost 3 years ago
- edit
/etc/locale.gen
(as root) and uncommenten_US.CP1252 CP1252
if not already uncommented; - execute
sudo locale-gen --no-purge --no-archive
. It should identify the new lines and compile only the new locales.locale -a
should print a list containingen_US.cp1252
. - create the cluster using
sudo -u postgres pg_createcluster -p 5439 9.5 test_1252 --locale=en_US.cp1252 --encoding=win1252 -d /mnt/ssd/pd_data/9.5/test_1252/
- execute
sudo pg_ctlcluster 9.5 test_1252 start
. The new cluster should be visible (in grren) usingpg_lsclusters
as:Ver Cluster Port Status Owner Data directory Log file 9.5 test_1252 5439 online postgres /mnt/ssd/pd_data/9.5/test_1252/ /var/log/postgresql/postgresql-9.5-test_1252.log
- connect (
sudo -u postgres psql -p 5439
) and create a sample database:postgres=# create database test_1252; CREATE DATABASE
- See how it looks from inside:
postgres=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+--------------+--------------+-----------------------+---------+------------+-------------------------------------------- postgres | postgres | WIN1252 | en_US.cp1252 | en_US.cp1252 | | 6960 kB | pg_default | default administrative connection database template0 | postgres | WIN1252 | en_US.cp1252 | en_US.cp1252 | =c/postgres +| 6841 kB | pg_default | unmodifiable empty database | | | | | postgres=CTc/postgres | | | template1 | postgres | WIN1252 | en_US.cp1252 | en_US.cp1252 | =c/postgres +| 6841 kB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | test_1252 | postgres | WIN1252 | en_US.cp1252 | en_US.cp1252 | | 6841 kB | pg_default | (4 rows)
As you can see, there is nothing related to FWD.