Bug #5085
uppercasing and string comparisons are incorrect for some character sets
0%
Related issues
History
#1 Updated by Greg Shah over 3 years ago
During the work on the CONTAINS
operator (starting at #1587-229), Igor encountered a problem in data import:
In the import of a large customer database I've got 4 errors for a single table on populating the related word table:
ERROR: character with byte sequence 0xce 0x9c in encoding "UTF8" has no equivalent in encoding "LATIN1"
Ovidiu noted this is something we've seen before:
There are some very similar messages posted in #3871.
However, this article has a short paragraph about iso8859-1 Client to 1252 conversion, if this is the problem. Most likely, it is the € (euro) sign, which is not present in WIN 1252 CP.
Between Igor and Ovidiu, they determined:
BTW: in LATIN1 µ is called not "Greek mu", but "micro", I think this is the reason why there is not its uppercase version in the codepage.
This is an interesting observation. The latin micro (
µ
) isU00B5
. The Greek lettersμ
andΜ
areU03BC
andU039C
respectively. Notice that they are not the very same glyph. I do not know who does this but convertingU00B5
to uppercaseU039C
is wrong.
and this:
The problem is that we convert to uppercase using Java toUpperCase() method which works with Unicode strings and knows nothing about the codepage. Note the codepage is the problem. It seems to me like a flaw in Java's uppercase implementation. Here is why:
char micro = '\u00B5'; char mulc = '\u03BC'; char muuc = '\u039C'; System.out.println(micro + "" + mulc + "" + muuc + "->" + Character.toUpperCase(micro) + Character.toUpperCase(mulc) + Character.toUpperCase(muuc)); System.out.println(Character.toUpperCase(micro) == Character.toUpperCase(mulc));
The output is:µμΜ->ΜΜΜ true
As you can see there are no CP involved. Maybe we should add a filter for thismicro
character when uppercasing it? I have not wrote/run the same code in 4GL yet.Question: are there other characters which behave the same?
and this:
PostgreSQL
UPPER
function for theLATIN1
database convertsµ
toµ
(doesn't change it).BTW: 4GL
UPPER
function also leavesµ
as-is, at least ifSESSION:CHARSET
isISO8859-1
.
and this:
I peeked at
java.lang.CharacterDataLatin1
, methodint toUpperCase(int)
which is called fromCharacter.toUpperCase(int)
and saw this (see line 152, it starts withelse
):int toUpperCase(int ch) { int mapChar = ch; int val = getProperties(ch); if ((val & 0x00010000) != 0) { if ((val & 0x07FC0000) != 0x07FC0000) { int offset = val << 5 >> (5+18); mapChar = ch - offset; } else if (ch == 0x00B5) { mapChar = 0x039C; } } return mapChar; }
Of all characters in Latin1 charset, this is the only one that is handled individually! The others are either bitwise offset or let unchanged.
This led Igor to ask:
However, it means the 4GL
UPPER
and JavatoUpperCase
are not compatible. Should we re-work FWDtoUpperCase()
functions implementation to match 4GL behavior?
#2 Updated by Greg Shah over 3 years ago
It clearly is something that needs to be fixed in some way.
Looking at our implementation, it is more than just the implementation of the UPPER()
built-in function. There are also:
CAPS()
(used inWHERE
clauses)ExpressionConversionWorker.toUpperCase()
used at conversion timeText.compareTo()
where we uppercase before comparison- hundreds (or thousands?) of locations in collections where we uppercase before inserting a key in a map, an item in a set or lookup via
get()
,contains()
orcontainsKey()
- and more, I'm sure
On top of this, today these locations all operate using the default JVM encoding AND I think we have (in the distant past) implicitly encoded dependencies that assume that the default JVM encoding is UTF-8. Our approach to I18N assumes that inside the JVM we will process everything as Unicode but when we handle input or output we will honor sources or targets which have a different encoding by translating from that encoding into Unicode (when reading input) and to that encoding (when writing output). The 4GL has a similar concept with CPSTREAM
, CPTERM
, CPPRINT
and CPINTERNAL
. The CPINTERNAL
is their equivalent of the default JVM encoding. The main deviation we have (conceptually) is that we have been assuming Unicode for CPINTERNAL
while OE defaults to ISO-8859-1
/LATIN1
/Windows 1252
.
I think we first must determine if our current approach can meet all of our compatibility requirements. If not, then we need to consider the alternative. If so, then we need to make a list of the changes needed to implement properly.
#3 Updated by Greg Shah over 3 years ago
- Related to Feature #1587: implement full support for word indexes added
#4 Updated by Greg Shah over 3 years ago
- Related to Support #3871: determine how to change codepages/locales during import added
#5 Updated by Ovidiu Maxiniuc over 3 years ago
If we talk about comparing insensitive character data, there is a dirty workaround. I noticed that toLowerCase()
in java.lang.CharacterDataLatin1
has no exceptions. So we can use LOWER
instead of UPPER
. However, one issue still remains: the CAPS()
function. According to Igor's it will work correctly with inlined PSQL SQL functions, but for the rest of internal FWD of CAPS()
and H2 SQL we need to re-implement it. Not sure about MSSQL dialect.
#6 Updated by Greg Shah over 3 years ago
As noted in #1587-235, lowercasing has issues too:
One thing here which is inconsistent in the 4GL is about using uppercase or lowercase in case-insensitive comparisons. Long ago, we found that (at least for
character
types inside the 4GL code) the case insensitive comparisons used UPPERCASE. This can be seen in our comments fromText.compareTo()
in the case insensitive path:// DO NOT use String.compareToIgnoreCase() since this lowercases and yields different // results for >, <, >= and <= forms when [ \ ^ _ ' characters are included in the operands return s1.toUpperCase().compareTo(s2.toUpperCase());
I don't think we can implement a quick and dirty solution here. It is time to address the broader set of problems.