Project

General

Profile

Feature #4389

escape support for supplementary unicode characters

Added by Greg Shah over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
Due date:
% Done:

0%

billable:
No
vendor_id:
GCD

History

#1 Updated by Greg Shah over 4 years ago

Branch 4069a 11374 conversion added support for encoding Basic Multilingual Plane (BMP) unicode characters as represented in the 4GL. BMP chars are the standard UTF-16 (fixed 2 byte representation) characters that are built into the low level Java char and String operations. In the 4GL these are encoded in source using ~uXXXX (Windows and Linux/UNIX) and \uXXXX (only in Linux/UNIX) where X is a hexadecimal digit.

There do exist other unicode characters outside of the BMP which cannot be represented in a single UTF-16 character. These are called supplementary characters and can be encoded naturally in UTF-32. In the 4GL these are encoded in source using ~UXXXXXX (Windows and Linux/UNIX) and \UXXXXXX (only in Linux/UNIX) where X is a hexadecimal digit. The previous 4069a code did not support this form of the escape sequence conversion.

Java can work with these using int types and the Character class to take UTF-32 characters and represent them as two sequential UTF-16 characters (Java char values) which must follow special encoding rules. These are called the high surrogate and low surrogate respectively. These can be encoded in Java source using this form.

Branch 4069a revision 11450 adds conversion support for this supplementary character escape encoding.

Please read the following references for the full details.

Basic Multilingual Plane
Java Lexical Structure
How Java Supplementary Character Encoding Works (really great reference!)
Unicode Escapes in Java
Testing Supplementary Characters
Unicode 12.1 Technical Site

Also available in: Atom PDF