Project

General

Profile

Feature #6428

implement IS-LEAD-BYTE() built-in function

Added by Greg Shah almost 2 years ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
Due date:
% Done:

0%

billable:
No
vendor_id:
GCD
version:

Related issues

Related to Base Language - Bug #4766: fix CHR and ASC New

History

#1 Updated by Greg Shah about 1 year ago

  • Assignee set to Joe Davis

This task is to implement the runtime support for IS-LEAD-BYTE(). Conversion support is already complete (search on prog.kw_is_lead in rules/convert/builtin_functions.rules). The runtime support is stubbed (see TextOps.isLeadByte()).

ABLUnit testcase code will need to be written which covers a suitable range of cases to show the full behavior of the function. Then the backing implementation can be written in I18nOps and called from TextOps.

#2 Updated by Joe Davis about 1 year ago

It's taken me a bit of time to grok the Text class, and how ABL codepages correspond to JVM ones, but I believe I have a working implementation (still working on the tests). I'd just like to confirm that it sounds reasonable, and I haven't committed any obvious missteps.

  • As I understand it, Text stores strings internally as Java String objects in UTF-16, and may optionally store a codepage along side it.
  • My implementation takes the String, calls asBytes on it with the correct character set / codepage, and takes the first character (I would prefer to just convert the first character of the string, but I suspect there would be issues there when using unicode and combining characters, or characters that need a surrogate pair to represent in UTF-16).
  • Then, depending on the codepage of the system or Text, the initial byte returned is then compared against the values in the page from Progress' documentation linked above.

#3 Updated by Greg Shah about 1 year ago

Roughly, this seems right.

  1. Before you go very far, please implement the 4GL testcase(s) to ensure we understand the actual behavior. Do not rely upon the 4GL reference docs to be correct or complete. I think you should even check any lead-byte cases for multi-byte encodings which they do not list but which would normally be seen in other common encodings (if there are any).
  2. I would look at the Java I18N support to see if any of the lead-byte detection methods can be used as helpers. It is less common that this can happen but for I18N the 4GL might be closer to the standards than they usually are.

#4 Updated by Joe Davis about 1 year ago

Here's a short summary of findings from today, I'll follow up with a status update shortly.

Greg Shah wrote:

Before you go very far, please implement the 4GL testcase(s) to ensure we understand the actual behavior. Do not rely upon the 4GL reference docs to be correct or complete. I think you should even check any lead-byte cases for multi-byte encodings which they do not list but which would normally be seen in other common encodings (if there are any).

Working on these. More details below.

I would look at the Java I18N support to see if any of the lead-byte detection methods can be used as helpers. It is less common that this can happen but for I18N the 4GL might be closer to the standards than they usually are.

From what I've found today, there doesn't appear to be anything in the standard library that would assist with this. There are a few internal classes that have private methods like isDoubleByteCharacter, but the details of encoding/decoding don't appear to be exposed.


I also took some time to go through our own I18NOps class, and while looking at the I18NOps.chr implementation, it appears not to support any unicode characters outside the Unicode Basic Multilingual plane:

// interpret key code using source codepage
String chr = get4glCharacter(ascCode, sourceJavaCP);
// 4gl checks for lead byte, however there should be only one character
if (chr == null || chr.length() > 1) {
return null;
}
return chr;
// interpret key code using source codepage
String chr = get4glCharacter(ascCode, sourceJavaCP);
// 4gl checks for lead byte, however there should be only one character
if (chr == null || chr.length() > 1) {
return null;
}
return chr;

The above will return null for any multi-byte characters (those with codepoints above 65534). This is noted in character.chr, but the Progress documentation seems to indicate that it should support a wider range of characters:

CHR: If the INTEGER value is greater than 255 and less than 15,712,191, CHR checks for a valid lead-byte value. If found, CHR returns a double-byte character.

I haven't yet tested this on the OpenEdge environment, and will pick up here in the morning. The rough testing strategy will be to iterate through valid characters (using CHR to checki valid), and then passing the result of CHR to IS-LEAD-BYTE. If it turns out that Progress does indeed support encodings above 65534 characters, then as far as I can tell, it will also be necessary to modify the CHR implementation to support that as well.

#5 Updated by Greg Shah about 1 year ago

Good work, yes this makes sense.

#6 Updated by Joe Davis about 1 year ago

Summary of findings for today:

  • I've confirmed that character.chr behaves differently to Progress' CHR for multi-byte characters. Take U+20341 (Encoded as surrogate pairs in UTF-16 as 0xD840DF41, decimal: 3628130113), a CJK character outside of the BMP for example: Progress' CHR successfully decodes it. While I18NOps.chr encounters a 2-character String upon decoding and returns null instead. Note as well that 3628130113 is greater than the stated range supported by CHR. For the most part, CJK characters will be within the BMP, so this may not be an issue in practice. Nevertheless, it's still an incompatibility. I think it can be solved by just removing the check for length in I18NOps.chr, but without knowing why that check was there in the first place I'd be hesitant to remove it.
  • I haven't yet tested all encodings with IS-LEAD-BYTE, but from the ones I have (those listed in the link above), there hasn't been any unexpected behaviour. I've implemented some tests cases for this, which I'll try to make more comprehensive come next week. Still getting used to ABL.

#7 Updated by Greg Shah about 1 year ago

but without knowing why that check was there in the first place I'd be hesitant to remove it.

If I recall correctly, it is there because we didn't implement the more complete support for multi-byte characters.

#8 Updated by Joe Davis about 1 year ago

I had misread the current docs, and didn't use ABLUnit for tests (instead writing standalone tests). I'm in the process of converting them to use ABLUnit, which is easy enough in and of itself and seems to be going smoothly, but it's not clear to me from the testcases docs how the ABLUnit tests are actually run (as opposed to standalone cases which can be passed as the startup-procedure parameter). There's a fair chance that I'm missing something obvious as to how to run them, do we have any documentation regarding this?

#9 Updated by Greg Shah about 1 year ago

No, you haven't missed anything. See #3827. The related code in 3827a is close to merging to trunk. Vladimir is writing the docs now.

#10 Updated by Vladimir Tsichevski about 1 year ago

Greg Shah wrote:

No, you haven't missed anything. See #3827. The related code in 3827a is close to merging to trunk. Vladimir is writing the docs now.

Yes, we are all waiting for 3827a is merged to the trunk and thus available to general public :-) I am in the course of writing some instructions on how to use ABLUnit in FWD.

#11 Updated by Joe Davis about 1 year ago

I've hit a wall here, as I misunderstood some Progress behaviour. During my initial exploration on the OpenEdge system, I checked the behaviour of IS-LEAD-BYTE with modified -cpinternal. I constructed strings using CHR of various charsets, and converted between them, noting the behaviour of IS-LEAD-BYTE.

When writing tests however, I wanted something more easily integrated into the test runner. Because it's not possible to change session:cpinternal at runtime (it has to be specified using the JVM -Dfile.encoding parameter), I used LONGCHAR in my tests and manually specified the codepage, an abridged example:

   def var test_chr as longchar.
   fix-codepage(test_chr) = "SHIFT-JIS".
   test_chr = chr(37882, "SHIFT-JIS", "SHIFT-JIS").
   ilb = is-lead-byte(c).

Because I'd run all of the CHR and IS-LEAD_BYTE(...) calls in Progress beforehand, I never retested the whole thing until now. It turns out that Progress does not allow for IS-LEAD-BYTE to be called on a LONGCHAR object.

So now I'm at a loss for how to proceed with testing, as the only way forward I can see is to write a script to rerun server.sh with a different -Dfile.encoding each time (which may cause additional problems when reading other files from disk).

#12 Updated by Greg Shah about 1 year ago

I think we have this same problem for all of our I18N tests, many of which still need to be written.

Because it's not possible to change session:cpinternal at runtime (it has to be specified using the JVM -Dfile.encoding parameter),

Actually, the CPINTERNAL is not supposed to be set using -Dfile.encoding. Instead, it should be set using the directory (see I18nOps.WorkArea which has private String cpInternal = getOverrideFromDirectory("i18n/cpinternal");. The code for we have for the "fall back" to 8859-1 when the default UTF-8 is set as the JVM -Dfile.encoding is some old nonsense which we haven't cleaned up. We should not use it to actually set the CPINTERNAL even though it is possible. I don't think it will work properly because we actually assume that the default JVM encoding is UTF-8 in some places in our code.

I don't think we've made that override possible from the command line. Technically we can use different accounts in the directory and have them set different CPINTERAL values but that would be pretty messy. Thus the problem here is the same as you've already identified: the server will need to be restarted in between test runs with different CPINTERNAL values unless we expose this directory override as a bootstrap config value. Then we can set that value for a given set of tests without restarting the server. I think this is the best approach.

It will require some additional development on your part to add this feature. Please see how we can initialize some client-side values/overrides in the class ClientParameters and then pass it to the server to set these values for the session.

#13 Updated by Greg Shah about 1 year ago

  • Related to Bug #4766: fix CHR and ASC added

#14 Updated by Ovidiu Maxiniuc about 1 year ago

In 4GL, the CPINTERNAL attribute is rather a constant than a variable. It is configured when the application is executed using -cpinternal command-line parameter; it can be obtained during the execution of a procedure using session:cpinternal but an attempt to alter it will raise error 4052 (not a settable attribute).

This setting drive the way the string literals are stored in memory as binary data. It's logical that you cannot alter it during runtime - this would make any character values stored up to that moment unreadable. The problem is that in Java, the String class we use for storing character data uses a fixed UTF-16 representation and the String has a char value[] as data payload. There is a single alternative to this -XX:+UseCompressedStrings which would switch to byte[] instead. As result, as long as we use the native String data type, we cannot directly emulate the CPINTERNAL. Each time the 4GL program requires to analyse data from this POV, FWD has to temporarily create a replica (of byte[]) of the internal storage (char[]) and return the requested value.

Here is an example using sign (we encountered issues with this char before).
  • It has Unicode U+20AC (8364). In Java it is represented as char[1] {0 = '€' 8364}. If we split it in byes we get {0x20, 0xAC}.
  • If the CPINTERNAL is the default ISO8859-1, it cannot be represented because it does not exist in this CP.
  • However, it exist in ISO8859-15 and the in-memory representation would be 1 byte long 164 0xA4 (normally since it is a single character and each character takes one byte to be represented in single-byte CPs). (Note: I will ignore the C-string terminator '\0' here).
  • When CPINTERNAL is set to UTF-8 in 4GL, this character takes three bytes: E2, 82, AC. The asc("€") is 14844588.

So, for the same character, IS-LEAD-BYTE will return NO if CPINTERNAL is ISO8859-1 and YES for UTF-8. This is not immediately visible form FWD/Java internal representation. We need to convert the (first) character from Java (String / char[]) to CPINTERNAL (byte[]) and analyse the latter to return the correct result.

For 1-byte CPs, it seems logic that IS-LEAD-BYTE will always return No. For the rest (UTF-8), I suspect that we might use 0xFF80 mask to get the result of this function like this:

return (myString.charAt(0) & 0xFF80) != 0;
, but conjecture must be proven.

#15 Updated by Vladimir Tsichevski about 1 year ago

I think, the unit testing for the function can be done in the form of JUnit5 Jupiter engine tests at the Java level, as we did before ABLUnit era.

#16 Updated by Greg Shah about 1 year ago

I think, the unit testing for the function can be done in the form of JUnit5 Jupiter engine tests at the Java level, as we did before ABLUnit era.

The problem with this is that setting things like CPINTERNAL will be needed for most or all of the I18N tests. We need to be able to run these tests in 4GL compatibility mode to ensure that all our I18N support actually works for customer applications. Let's plan on an approach where we set the values as bootstrap cfg overrides and then run a set of tests. We will have to run these same tests many times with different values.

#17 Updated by Vladimir Tsichevski about 1 year ago

Greg Shah wrote:

I think, the unit testing for the function can be done in the form of JUnit5 Jupiter engine tests at the Java level, as we did before ABLUnit era.

The problem with this is that setting things like CPINTERNAL will be needed for most or all of the I18N tests. We need to be able to run these tests in 4GL compatibility mode to ensure that all our I18N support actually works for customer applications. Let's plan on an approach where we set the values as bootstrap cfg overrides and then run a set of tests. We will have to run these same tests many times with different values.

I think, we can also provide a "backdoor" way to change the CPINTERNAL in runtime for unit testing exclusively. Each test or a set of related tests will change the value for the test duration and restore it back to the original value on when the test is complete.

#18 Updated by Greg Shah about 1 year ago

Vladimir Tsichevski wrote:

Greg Shah wrote:

I think, the unit testing for the function can be done in the form of JUnit5 Jupiter engine tests at the Java level, as we did before ABLUnit era.

The problem with this is that setting things like CPINTERNAL will be needed for most or all of the I18N tests. We need to be able to run these tests in 4GL compatibility mode to ensure that all our I18N support actually works for customer applications. Let's plan on an approach where we set the values as bootstrap cfg overrides and then run a set of tests. We will have to run these same tests many times with different values.

I think, we can also provide a "backdoor" way to change the CPINTERNAL in runtime for unit testing exclusively. Each test or a set of related tests will change the value for the test duration and restore it back to the original value on when the test is complete.

I did think about this, but the bootstrap cfg approach does provided this same capability. One would set the new value in the bootstrap cfg used to start the test run, then that value would be there for the duration of all tests. The advantage of this is that it won't have any weird side effects or problems introduced because it will be set the same way as the production install would do it. The only down side is that all tests will share the same values. This means that different values would require multiple test runs, which I think is OK.

#19 Updated by Greg Shah 8 months ago

  • Assignee deleted (Joe Davis)

Also available in: Atom PDF