Feature #3971
add support for unicode escape sequences in the preprocessor
50%
History
#1 Updated by Greg Shah about 5 years ago
At a minimum, ClearStream.read()
must be updated to support the new Unicode escape sequences that can be encoded in 4GL source code.
This section describes inputting Unicode codepoints in ABL code, in support of supplementary characters. To input Unicode scalar codepoints in plane 0 (U+0000 to U+FFFF), use this syntax: Syntax ~uXXXX in XXXX A 4-digit, case-insensitive hex digit. To input Unicode scalar codepoints in planes 0 - 16 (U+0000 to U+10FFFF), use this syntax: Syntax ~UXXXXXX XXXXXX A 6-digit, case-insensitive hex digit. When the ABL code is parsed, this character value is converted from Unicode to -cpinternal. If the character is not a valid character in -cpinternal, the entire escaped string is passed through. For example, if -cpinternal is 1252, ~u4E00 is passed to ABL as is.
#5 Updated by Alexandru Lungu 3 months ago
The fix for this is simply allowing the unicode marker to "escape" the preprocessor:
=== modified file 'src/com/goldencode/p2j/preproc/ClearStream.java' --- old/src/com/goldencode/p2j/preproc/ClearStream.java 2023-05-12 10:05:12 +0000 +++ new/src/com/goldencode/p2j/preproc/ClearStream.java 2024-02-22 11:10:00 +0000 @@ -144,6 +144,7 @@ ** TJD 20220504 Java 11 compatibility minor changes ** CA 20221129 Do not process unix escapes found in comments. ** 030 GBB 20230512 Logging methods replaced by CentralLogger/ConversionStatus. +** 031 AL2 20240222 Added support for unicode characters (~uXXXX). */ /* @@ -714,6 +715,9 @@ case 'f': nextChar = 0x0C; break; + case 'u': + passThru = true; + break; default: translated = false; }
I committed this to 3971a/rev. 15000. The rules will then make ~
into \
and use the native Java representation ~u2019
into \u2019
. This is a partial fix and solves only ~uXXXX
and not ~UXXXXXX
characters (which are not supported in FWD yet).
- Having an incomplete representation (like
~uX
) is possible in 4GL (not throwing errors), but provides really unusable results. For instance:&glob test ~u2 &glob test 1
works, but will replace any test
occurrence with [^]lob test 1
(where [^] is the unicode character for ~u2\n&g unicode representation).
- Also, having invalid unicodes is possible in 4GL:
&glob test ~uxxxx
This doesn't show any character in 4GL, but in FWD this breaks as \uxxxx
is not a valid unicode character.
Problem 1. was not present before because unicode characters in preprocessor were not supported anyway. This can be fixed separately on a really low priority
Problem 2. this is not related to the preprocessor, but how Java handles invalid unicode characters from 4GL. Can be fixed separately.
Greg, please review 3971a. I tested with a large customer application (with incremental conversion of some files) and it behaves OK.
#6 Updated by Alexandru Lungu 3 months ago
- Assignee set to Alexandru Lungu
- % Done changed from 0 to 50
- Status changed from New to WIP
I talked with Constantin and the fix is good to go. I am merging it now.
#7 Updated by Alexandru Lungu 3 months ago
3971a was merged to trunk as rev. 15000 and archived.
#8 Updated by Alexandru Lungu 3 months ago
- Status changed from WIP to Review
There was an issue with unicode processing in CASE-WHEN statements:
def var a as char. case a: when '~u00ea' then message "abc". end case.
This is converted in Java as a switch case with \U00EA
case. This is a syntax error. Instead of upper-casing the unicode character, it simply upper-cases the characters of the unicode code.
I fixed this in 3971b by enforcing an if-else structure instead of a switch-case one. The code now converts and compilers properly.
Please review 3971b.
#11 Updated by Alexandru Lungu 3 months ago
- Status changed from Merge Pending to WIP
3971b was merged to trunk as rev. 15004 and archived.
This task will stay open for the \uXXXXXX
cases to be implemented.