Project

General

Profile

Feature #3971

add support for unicode escape sequences in the preprocessor

Added by Greg Shah about 5 years ago. Updated 3 months ago.

Status:
WIP
Priority:
Normal
Target version:
-
Start date:
Due date:
% Done:

50%

billable:
No
vendor_id:
GCD
version:

History

#1 Updated by Greg Shah about 5 years ago

At a minimum, ClearStream.read() must be updated to support the new Unicode escape sequences that can be encoded in 4GL source code.

From https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dvint%2Fnew-and-modified-keywords.html%23

This section describes inputting Unicode codepoints in ABL code, in support of supplementary characters.

To input Unicode scalar codepoints in plane 0 (U+0000 to U+FFFF), use this syntax:

Syntax

~uXXXX in 

XXXX

A 4-digit, case-insensitive hex digit.

To input Unicode scalar codepoints in planes 0 - 16 (U+0000 to U+10FFFF), use this syntax:

Syntax

~UXXXXXX

XXXXXX

A 6-digit, case-insensitive hex digit.

When the ABL code is parsed, this character value is converted from Unicode to -cpinternal. If the character is not a valid character in -cpinternal, the entire escaped string is passed through. For example, if -cpinternal is 1252, ~u4E00 is passed to ABL as is.

#5 Updated by Alexandru Lungu 3 months ago

The fix for this is simply allowing the unicode marker to "escape" the preprocessor:

=== modified file 'src/com/goldencode/p2j/preproc/ClearStream.java'
--- old/src/com/goldencode/p2j/preproc/ClearStream.java    2023-05-12 10:05:12 +0000
+++ new/src/com/goldencode/p2j/preproc/ClearStream.java    2024-02-22 11:10:00 +0000
@@ -144,6 +144,7 @@
 **     TJD 20220504          Java 11 compatibility minor changes
 **     CA  20221129          Do not process unix escapes found in comments.
 ** 030 GBB 20230512          Logging methods replaced by CentralLogger/ConversionStatus.
+** 031 AL2 20240222          Added support for unicode characters (~uXXXX).
 */

 /*
@@ -714,6 +715,9 @@
                case 'f':
                   nextChar = 0x0C;
                   break;
+               case 'u':
+                  passThru = true;
+                  break;
                default:
                   translated = false;
             }

I committed this to 3971a/rev. 15000. The rules will then make ~ into \ and use the native Java representation ~u2019 into \u2019. This is a partial fix and solves only ~uXXXX and not ~UXXXXXX characters (which are not supported in FWD yet).

  1. Having an incomplete representation (like ~uX) is possible in 4GL (not throwing errors), but provides really unusable results. For instance:
    &glob test ~u2
    &glob test 1

works, but will replace any test occurrence with [^]lob test 1 (where [^] is the unicode character for ~u2\n&g unicode representation).

  1. Also, having invalid unicodes is possible in 4GL:
    &glob test ~uxxxx

This doesn't show any character in 4GL, but in FWD this breaks as \uxxxx is not a valid unicode character.

Problem 1. was not present before because unicode characters in preprocessor were not supported anyway. This can be fixed separately on a really low priority
Problem 2. this is not related to the preprocessor, but how Java handles invalid unicode characters from 4GL. Can be fixed separately.

Greg, please review 3971a. I tested with a large customer application (with incremental conversion of some files) and it behaves OK.

#6 Updated by Alexandru Lungu 3 months ago

  • Assignee set to Alexandru Lungu
  • % Done changed from 0 to 50
  • Status changed from New to WIP

I talked with Constantin and the fix is good to go. I am merging it now.

#7 Updated by Alexandru Lungu 3 months ago

3971a was merged to trunk as rev. 15000 and archived.

#8 Updated by Alexandru Lungu 3 months ago

  • Status changed from WIP to Review

There was an issue with unicode processing in CASE-WHEN statements:

def var a as char.
case a:
    when '~u00ea' then message "abc".
end case.

This is converted in Java as a switch case with \U00EA case. This is a syntax error. Instead of upper-casing the unicode character, it simply upper-cases the characters of the unicode code.

I fixed this in 3971b by enforcing an if-else structure instead of a switch-case one. The code now converts and compilers properly.

Please review 3971b.

#9 Updated by Greg Shah 3 months ago

  • Status changed from Review to Internal Test

Code Review Task Branch 3971b Revision 15004

The change looks good. What other testing is needed?

#10 Updated by Greg Shah 3 months ago

  • Status changed from Internal Test to Merge Pending

I understand it is already tested. It can merge to trunk now.

#11 Updated by Alexandru Lungu 3 months ago

  • Status changed from Merge Pending to WIP

3971b was merged to trunk as rev. 15004 and archived.

This task will stay open for the \uXXXXXX cases to be implemented.

Also available in: Atom PDF