Feature #3753: I18N additions - Base Language - Golden Code Redmine

Feature #3753

I18N additions

Added by Greg Shah over 5 years ago. Updated almost 4 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

Start date:

Due date:

% Done:

100%

billable:

vendor_id:

GCD

version_reported:

version_resolved:

code_pages_order.jpg - GET-CODEPAGES result order (71.4 KB) Eugenie Lyzenko, 05/02/2019 08:58 PM

Related issues

History

#1 Updated by Greg Shah over 5 years ago

Related to Feature #3292: i18n improvements added

#2 Updated by Greg Shah over 5 years ago

Implement the following I18N features for Phase 1 (POC):

FIX-CODEPAGE statement
CODEPAGE-CONVERT (runtime)
GET-CODEPAGE
CURRENT-LANGUAGE conversion
INPUT STREAM CONVERT runtime support
SESSION:CPSTREAM (improve the runtime)

#3 Updated by Greg Shah over 5 years ago

Implement these for Phase 2 (main project):

generalized CONVERT option support (e.g. not just INPUT but also used in OUTPUT TO)
NO-MAP I/O option
NO-CONVERT runtime support
GET-CODEPAGES() function
IS-CODEPAGE-FIXED() function
CURRENT-LANGUAGE function and statement (it is partial today, complete the runtime support; please see #3817 which is related to this)
#3817 string resource bundles and translation manager replacement

#4 Updated by Constantin Asofiei over 5 years ago

Assignee set to Constantin Asofiei
Status changed from New to WIP

#5 Updated by Constantin Asofiei over 5 years ago

The conversion issues for this task (phase 1) are solved in 3750a rev 11297.

#6 Updated by Constantin Asofiei over 5 years ago

A note to check: editor with LONGCHAR value, having a non-default codepage.

#7 Updated by Greg Shah over 5 years ago

The CURRENT-LANGUAGE implementation really should not be needed for the POC, since the default language's string constants can be used. As such, this will be deferred to phase 2.

#8 Updated by Greg Shah over 5 years ago

Assignee changed from Constantin Asofiei to Eugenie Lyzenko

Constantin: Eugenie is going to take this task. I know you finished part of the work on phase 1 (conversion and some runtime), but that not all was complete. If you can please do the following:

1. If you have any pending work, we need to get that into a branch somewhere.
2. We need an update to this task to detail what is done and what is left to do.
3. A specific list of items and questions for which we need testcases.

Eugenie: If there are parts of the work for which we don't need testcases, you can start work on those.

#9 Updated by Eugenie Lyzenko over 5 years ago

Greg Shah wrote:

Eugenie: If there are parts of the work for which we don't need testcases, you can start work on those.

OK.

#10 Updated by Constantin Asofiei over 5 years ago

All runtime needs to be implemented. Below is a list of what it needs to be done.

For FIX-CODEPAGE and GET-CODEPAGE, both conversion and runtime support required. Testing should be done for:

is the codepage copied from one longchar value to another?
is the codepage involved in comparison operators?
FIX-CODEPAGE with empty, unknown, non-empty longchar vars
what if the codepage is already set?
clob fields - can they work with fix-codepage and get-codepage? Can the codepage be set in some other way?
editor with large-object (which can display a LONGCHAR val, with or without a codepage set) - how is the text displayed?
assigning a longchar to a char - is the codepage inherited from the rvalue?
assignment between longchars - the same, is the code page included in the assign?
is the codepage affecting the character bytes? For example:
```
// lc1a, lc1b - same codepage
// lc2 - other codepage
lc2 = "some text which may differ in the codepage".
lc1a = "some text which may differ in the codepage".
lc1b = lc2.
```
Are lc1a and lc1b equal - is the final text in lc1b unaffected by the initial codepage in lc2? The idea here is to determine if the longchar's codepage is used when assigning a text to it (thus the reference text is kept in memory converted in the target codepage).
how are other statements which work with strings, affected?

dlc/convmap.cp support:

we need to specify the list of known codepages (or default to the Java's available codepages)
CODEPAGE-CONVERT, INPUT STREAM CONVERT work with this
what is the default codepage value - some explanation is in https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dvint/determining-the-code-page.html

CODEPAGE-CONVERT and INPUT STREAM CONVERT:

combinations of source and target codepages
the source codepage is not the real text's codepage
source/target codepages are not in the convmap.cp list

#11 Updated by Constantin Asofiei about 5 years ago

Greg, see above for the runtime I18N - are these what you are looking for?

About the translation manager and translatable strings; we need tests to prove:

Are all strings without :U translatable? - see these https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dvref/-22--22character-string-literal.html
Translation Database
- How is the translation database saved? Is this a simple Progress DB? If so, what is the schema?
- What is the character encoding of the database?
- Can the source text and translated text be in different encodings?
- Is there any functionality in OpenEdge that read the translation database at runtime or is this only for building the compile-time r-code text segments?
How you can switch between translations, is this related to CURRENT-LANGUAGE?
Are 4GL system error messages translated, too? (we have been assuming YES, that CURRENT-LANGUAGE will select different message sources in OpenEdge)
Are only standalone static strings translated? What if the string is in an expression, like "there is an error in program" + pname - the there is an error in program, will this string be translatable?
How does the 4GL behave if a string has a translation and others do not; is this something done at compile time, so a translation can't be done in future, or at runtime, and the .r will see any newly added translation?

#12 Updated by Greg Shah about 5 years ago

see above for the runtime I18N - are these what you are looking for?

Yes, this is what I was looking for.

#13 Updated by Constantin Asofiei about 5 years ago

Something else to add:

texts at the schema definition (labels, and so on) - are these translatable?

#14 Updated by Greg Shah about 5 years ago

Eugenie: Please make a list of the items in this task which do not need any testcases written. These are items for which you already have enough information to implement. I imagine that all the conversion and some of the runtime features can be worked now.

#15 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

Eugenie: Please make a list of the items in this task which do not need any testcases written. These are items for which you already have enough information to implement. I imagine that all the conversion and some of the runtime features can be worked now.

OK.

#16 Updated by Eugenie Lyzenko about 5 years ago

Constantin,

Can you provide the short list of what is already DONE in this task?

I'm creating the implementation plan and need to exactly separate things that are ready from other TODO list.

#17 Updated by Constantin Asofiei about 5 years ago

Eugenie Lyzenko wrote:

Constantin,

Can you provide the short list of what is already DONE in this task?

Items in #3753-2 should have full conversion support already, with stubbed (or partial) runtime. Items in #3753-3 I don't think have conversion support, and no runtime.

I would start with conversion support for #3753-3, as the syntax is not that complex.

#18 Updated by Eugenie Lyzenko about 5 years ago

My first steps plan is to:
1. Investigate in details all external resources to be used during I18 runtime support. For now I see only dlc/convmap.cp file. We need to properly define where these resources will be located and how we will use them.
2. Implement some simple features that uses external resources from point 1:
- FIX-CODEPAGE() statement
- IS-CODEPAGE-FIXED() function
- CODEPAGE-CONVERT() function
- GET-CODEPAGE() function
- GET-CODEPAGES() function
- CURRENT-LANGUAGE() function and statement

Let me know if this plan needs corrections according to current projects requirements.

#19 Updated by Eugenie Lyzenko about 5 years ago

And I will create 3753a branch to upload the changes if no objections.

#20 Updated by Eugenie Lyzenko about 5 years ago

Created task branch 3753a from trunk revision 11301.

#21 Updated by Greg Shah about 5 years ago

Let me know if this plan needs corrections according to current projects requirements.

I think the plan is OK. The tricky part is trying to avoid the areas where tests are being written. If you must explore some of these topics with your own testcases (due to time constraints), please note these here in advance so that people writing tests don't duplicate work.

#22 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

Let me know if this plan needs corrections according to current projects requirements.

I think the plan is OK. The tricky part is trying to avoid the areas where tests are being written. If you must explore some of these topics with your own testcases (due to time constraints), please note these here in advance so that people writing tests don't duplicate work.

OK. I think I can not completely avoid the testcases. Just to verify the implemented functionality works. Something simple, like:

resilt = function|statement(args).
message result.

It is required to be not completely blind in implementation process.

#23 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11302.

The first steps in I18N implementation. Added the conversion and runtime support for GET-CODEPAGES function. Currently runtime is based on Java internal variables, no convmap.cp feature yet. This is in research/design stage.

The testcases repo has been updated to rev 1832 with simple testcases for GET-CODEPAGE(S) functions.

Several considerations.

Point to clarify about understanding FIX-CODEPAGE/IS-CODEPAGE-FIXED statement/function. Correct me if I'm wrong My understanding for parameter meaning passing to them is in ABL it is the name of the variable not yet initialized(not assigned to something), not the variable itself. For example the following code is correct:

  def var chVar as longchar.
  FIX-CODEPAGE("chVar") = "IBM850".

but the following code is not correct:

  def var chVar as longchar.
  FIX-CODEPAGE(chVar) = "IBM850".

This means internally Progress keeps the fixed variable map list in format:
"name"<-->"codepage"
Then when the operation that is code page dependent is calling the Progress looks if the actors is in fixed variable names map and if it is in - uses redefined code page instead of the default one(or defined for other transformations) for this variable as the source CP if business logic need to convert the var to another code page. Is it correct understanding?

On the other hand the Java strings internally has no code page assigned per String object, meaning the String object is the set of bordered bytes. The code pages for transforming are defined during transformation. So we will need to keep the registry map for all variable names in the current session that uses "fixed" codepage.

Is it OK to implement all I18N specific inside p2j/util/EnvironmentOps class? Or it will be better to create another helper class completely dedicated to I18N implementation? I guess the new I18N could be big part of the file.

Continue working. Next step will be FIX-CODEPAGE(), IS-CODEPAGE-FIXED(), GET-CODEPAGE(). And attaching the FWD functionality of the convmap.cp.

#24 Updated by Greg Shah about 5 years ago

I don't understand. The following code works:

def var lc as longchar.
message "LC CP (before) = " + get-codepage(lc).
message get-codepages.
fix-codepage(lc) = "1252".
message "LC CP (after) = " + get-codepage(lc).

/*
this generates:
?
1256,709,708,721,711,786,714,710,720,BIG-5,GB2312,CP936,CP950,IBM852,1250,ISO8859-2,1253,IBM851,ISO8859-8,IBM862,IBM850,IBM858,ISO8859-1,ISO8859-15,SHIFT-JIS,EUCJIS,KSC5601,CP949,CP1361,1252,1257,MAZOVIA,ROMAN-8,KOI8-R,1251,IBM866,ISO8859-5,62
0-2533,1254,IBM857,UNDEFINED,IBM861,IBM437,UTF-8,UCS2,UTF-32,UTF-16,UTF-16BE,UTF-16LE,UTF-32BE,UTF-32LE,ISO6937,CP950-HKSCS,GB18030
LC CP (after) = 1252
*/

Using a string literal or char expression as the parameter to FIX-CODEPAGE() does not work.

#25 Updated by Greg Shah about 5 years ago

Currently runtime is based on Java internal variables, no convmap.cp feature yet.

We should start with a standard set of known codepages and a way to map them to Java charsets.

Customers will need a way to customize this. The additional codepage to charset mappings should be implemented in the directory. But the standard set should be built in to the runtime, no directory entries needed.

Then when the operation that is code page dependent is calling the Progress looks if the actors is in fixed variable names map and if it is in - uses redefined code page instead of the default one(or defined for other transformations) for this variable as the source CP if business logic need to convert the var to another code page. Is it correct understanding?

I don't think so. I think this just sets a value inside the longchar var itself.

You can only set this value before the var has real data. We will need to test to see how it affects assignment, copy-lob, overlay, substring and other statements that can assign data.

On the other hand the Java strings internally has no code page assigned per String object, meaning the String object is the set of bordered bytes. The code pages for transforming are defined during transformation. So we will need to keep the registry map for all variable names in the current session that uses "fixed" codepage.

This should not be needed. I think the codepage just changes how the data is transformed at assignment.

Is it OK to implement all I18N specific inside p2j/util/EnvironmentOps class? Or it will be better to create another helper class completely dedicated to I18N implementation? I guess the new I18N could be big part of the file.

Better to create a new helper class. But the functions/statements that operate only on longchar should be in the longchar class.

#26 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

I don't understand. The following code works:

[...]

Using a string literal or char expression as the parameter to FIX-CODEPAGE() does not work.

OK. I need to add conversion support for FIX-CODEPAGE() statement because it is missing.

#27 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

...
Better to create a new helper class. But the functions/statements that operate only on longchar should be in the longchar class.

I'm going to introduce new helper class in p2j/utils/I18nOps.java to separate all I18N server side specific from rest of the environment processing(except longchar variable related calls which will be handled in longchar). Let me know please if class name or location is wrong.

#28 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11303.

This is the support for FIX-CODEPAGE statement(conversion and runtime). Reworked support for GET-CODEPAGES, added support for IS-CODEPAGE-FIXED, GET-CODEPAGE. If the feature is longchar dependent - the implementation is inside longchar class.

The testcases updated to revision 1833, added simple test for FIX-CODEPAGE/IS-CODEPAGE-FIXED.

The reworked approach for GET-CODEPAGES call is based on idea to have 4GL to Java mapping for currently supported charset names. For now the default convmap code pages set is what we have in original 4GL system. From this set we select only character sets that is supported in Java base package. I have scanned Java charset, some code pages have found but some - not. Need additional work to find out what to do with 4GL encodings. The problematic code pages:

709
708
721
711
786
714
710
720
CP936
CP950
IBM858
EUCJIS
KSC5601
CP949 - is it x-windows-949?
CP1361
MAZOVIA
ROMAN-8
UNDEFINED
UCS2
ISO6937

Continue working. Also need to understand what is the special UNDEFINED code page value? Default for current OS?

The next step will be adding full support for CURRENT-LANGUAGE() and CODEPAGE-CONVERT().

#29 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been rebased with trunk 11302, new revision is 11304.

#30 Updated by Eugenie Lyzenko about 5 years ago

The testcases bzr repo updated to rev 1834 with simple test for CURRENT-LANGUAGE statement/function to verify the support.

The result - we already have full conversion and runtime support for this based on current directory configuration. The current language value is stored permanently in directory.xml from one session to another. If this approach is OK we have nothing to do for this statement/function.

Starting to work on CODEPAGE-CONVERT() function.

#31 Updated by Greg Shah about 5 years ago

The result - we already have full conversion and runtime support for this based on current directory configuration. The current language value is stored permanently in directory.xml from one session to another. If this approach is OK we have nothing to do for this statement/function.

Please read #3817 carefully. Setting the CURRENT-LANGUAGE will replace string literals in the code with a different version stored at compile time. The trick to seeing this is that the CURRENT-LANGUAGE must be changed and then the next programs loaded will be affected. Of course, they must have had the replacement strings setup using the Translation Manager. I really doubt we support any of this. Just being able to query and set the value is not enough.

#32 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11305.

Added stub to handle generic case of the CODEPAGE-CONVERT() function in new helper class I18nOps. Several versions of calls in TextOps will finally call single method in I18nOps to generalize processing. Current conversion approach is not changed. Continue working.

#33 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

The result - we already have full conversion and runtime support for this based on current directory configuration. The current language value is stored permanently in directory.xml from one session to another. If this approach is OK we have nothing to do for this statement/function.

Please read #3817 carefully. Setting the CURRENT-LANGUAGE will replace string literals in the code with a different version stored at compile time. The trick to seeing this is that the CURRENT-LANGUAGE must be changed and then the next programs loaded will be affected. Of course, they must have had the replacement strings setup using the Translation Manager. I really doubt we support any of this. Just being able to query and set the value is not enough.

OK. Reading.

So far all the translatable strings(labels, static text, ...) that are loading after CURRENT-LANGUAGE change should be replaced with new language specific version, right? Or totally all character based text(event if CURRENT-LANGUAGE had the old value in a time of loading)?

I mean do we need to have text set for every language(or translate it dynamically) for auto-refresh as reaction for CURRENT-LANGUAGE change?

#34 Updated by Greg Shah about 5 years ago

Don't implement the CURRENT-LANGUAGE runtime right now. I think this needs to wait until we have tests that show the specific behavior. The way it was described, we must track the setting of this value for each program that is loaded and only that language is used for that program even if CURRENT-LANGUAGE is changed while it is running. This needs to be proven, but it means a more complicated implementation since it is set at the time the program loads.

So far all the translatable strings(labels, static text, ...) that are loading after CURRENT-LANGUAGE change should be replaced with new language specific version, right?

No, I think it is all translatable strings in programs that are loaded after CURRENT-LANGUAGE change.

Or totally all character based text(event if CURRENT-LANGUAGE had the old value in a time of loading)?

No, I think it is only the string literals that are not marked :U (untranslatable).

I mean do we need to have text set for every language(or translate it dynamically) for auto-refresh as reaction for CURRENT-LANGUAGE change?

No. The customer has a database that has these translations. At conversion, we would read these and create Java resource bundles that would be used by specific programs depending on the CURRENT-LANGUAGE setting when the containing program was loaded.

#35 Updated by Greg Shah about 5 years ago

Related to Feature #3817: create resource bundles from string literals and implement optional support for setting values from the translation manager database added

#36 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been rebased with trunk 11303, new revision is 11306.

#37 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11307.

This adds base level of the CODEPAGE-CONVERT() function implementation from source CP to target one. The idea is to use Java integrated converters with Java byte[] as intermediate buffer.

The testcases repo has been updated to rev 1835 with simple testcase for CODEPAGE-CONVERT() method.

Had to tell the testcases is too simple and can be used to check the conversion support and proper runtime logic calls. It is a bad idea to use it to test output provided by CODEPAGE-CONVERT() for several reasons:
- The original input string content can vary depending on OS default code page where the test is running(for 4GL and FWD).
- In Linux the char inside braces is small a umlaut, while in Windows it contains 2 chars, large A with upper tilde and some other char. So the output is different and not clear if this is the OS code page change or FWD bug.

So we certainly need some OS neutral testcase which can produce stable result in every OS. May be we need to read original strings from file or represent strings as a set of integer arrays.

What I think and need confirmation or decline is if we have some char shape in one charset(say small a umlaut) - no matter to what target CP it will be converted - the char glyph remains the same(small a umlaut), correct? It there is no such char in target CP - some undefined char shape will be the result.

For any char the transformations 1.(code page 1) -> 2.(code page 2) -> 3.(code page 1) must get the original character(before step 1), correct?

The default char set for 4GL/FWD is ISO8859-1, right? Does it mean the JVM current charset must be reset to ISO8859-1 instead of UTF-8 used in Linux? I mean not only getting SESSION:CHARSET for FWD code to return this value but also using this value inside all Java string conversions calls. I guess we will have to go this way to duplicate the 4GL behavior.

#38 Updated by Greg Shah about 5 years ago

Progress handles its various text processing/sorting/conversion using 5 possible input tables:

character attributes (whether something is an alphabetic char or not)
case tables (how to translate between upper/lower case)
collation (how to sort)
code page conversion (how to translate a char in a source cp to the same char in the target cp)
word break (how to delimit words)

We will implement a 4GL program that explicitly uses 4GL code that must depend upon these tables. By processing all possible inputs we can observe the output and save the input to output mapping in our own file format.

I think this is easily done.

character attributes (use LC and CAPS to convert case of characters, only characters that change case are alphabetic)
case conversion (use LC and CAPS to convert case of strings)
collation (use EQ, NE, GT, LT, GE or LE operators to compare strings)
code page conversion (use COPEPAGE-CONVERT() or ASC() or CHR() to convert using specific source and target codepages).
word break tables will require the right set of input strings and then the use of contains with the right possible match targets to determine how word breaks work for a specific codepage (this one is trickier but should be possible)

Before you do that, please read the following references:

I18N documentation (in v12.0, it is named internationalize-abl.pdf)
dlc/prolang/README (text file with some details about their I18N implementation)

Once you have 4GL code that can calculate these values, we should run them on some common input/output codepage combinations. Then we need to check the Java version of these conversions to see if it is the same. If exactly the same, then we can use the standard Java implementation. If it is not the same, then we will have to override at least some of the implementation.

#39 Updated by Greg Shah about 5 years ago

For any char the transformations 1.(code page 1) -> 2.(code page 2) -> 3.(code page 1) must get the original character(before step 1), correct?

Yes.

The default char set for 4GL/FWD is ISO8859-1, right?

At one time, I think ISO8859-1 was possibly the 4GL default. It may still be the default. When you create a database, I think you can select any value. And when you run, you can set the session default using the -cpinternal command line option. And at installation time, I think you might be able choose the default as well.

For FWD, the default is determined by the operating system locale. On Linux, this tends to be UTF-8.

Does it mean the JVM current charset must be reset to ISO8859-1 instead of UTF-8 used in Linux? I mean not only getting SESSION:CHARSET for FWD code to return this value but also using this value inside all Java string conversions calls. I guess we will have to go this way to duplicate the 4GL behavior.

So far, for string processing in the converted code we have not found a requirement to internally store string data in anything other than Unicode. Generally speaking, it seems that all of the charset/codepage settings in the 4GL are related to controlling how text data is read from input or written to output. For example:

CPLOG (logfile output)
CPPRINT (output to printer)
CPRCODEIN (reading text from compiled code)
CPRCODEOUT (how text will be converted during compilation)
CPSTREAM (stream IO like files and processes)
CPTERM (character terminals)

None of those relate to how the data is internally stored or compared. Those are controlled using these:

CPCASE (read-only, set using the -cpcase command line parm or the default database value is used)
CPCOLL (read-only, set using the -cpcoll command line parm or the default database value is used)
CPINTERNAL (read-only, set using the -cpinternal command line parm or ? if not specified)

I do wonder what happens if these are:

conflicting with each other
conflicting with the database setting
conflicting with the operating system locale

For now, we will assume that all users will have the same CPINTERNAL, CPCOLL and CPCASE values and that UTF-8 is OK for these values. If this changes, then we will need to implement a deeper approach all of our string processing so that each character variable has knowledge of these values. I'd like to avoid that for now.

At the database, this is different because we have found the need to implement a custom locale for both H2 and PostgreSQL to sort properly. I think this is related to the input databases being in ISO8859-1. I don't know if we haven't had access to a suitable ISO8859-1 locale or if the Progress version of it was just customized so that we needed to do our own version.

Eric/Ovidiu: Perhaps you can comment on this?

#40 Updated by Eugenie Lyzenko about 5 years ago

After reading some background info and Progress documentation some point become more clear for implementation perspective.
1. The suggest that character shape should be the same before and after conversion(example - a umlaut) is correct - no special processing in 4GL.
2. But integer code point(character code) us changing.
3. The point number 1 is problematic to verify because either way we see the compared texts in a single predefined code page and in general case the character shape will be different(example - a umlaut).
4. So far the only predictable way is to verify consistency of the integer code behind character set. Here is the example:

DEFINE VARIABLE char850 AS CHARACTER NO-UNDO.
DEFINE VARIABLE charsetstring AS CHARACTER NO-UNDO.
define variable intCode as integer no-undo.

intCode = 132.
char850 = CHR(intCode, "UNDEFINED").

message "Current session charset: " SESSION:CHARSET.
message "Original char is: " intCode.

charsetstring = CODEPAGE-CONVERT(char850, "ISO8859-1", "ibm850").
message "IBM850 -> ISO8859-1 conversion is: " asc(charsetstring, "UNDEFINED").

In this example 132 integer character code will be converted to 228 code in ibm850 -> ISO8859-1 conversion. And this can easily be seen in 4GL in Windows.

The issue with this test is the fact we have no ASC/CHR proper runtime support for source/target character page.

So I'm planning to implement missing ASC/CHR features related to code page. This way we will have good testing environment for further research.

Continue investigation.

I need to understand what does it mean the Progress character in particular code page. For example char text with CP IBM850 means single byte set(not Unicode). The Java internal string representation is UTF-16. If it is required to do some output this internally converts to OS supported code page, UTF-8 in Linux or Windows-1252 in Windows for example.

Also need to research what is the difference in Progress native Unicode support.

And I need to consider another possible internal representation of the 4GL character type. May be integer array is better than String. Just because using String we do character conversion every time we use String. Not sure, need to investigate.

#41 Updated by Greg Shah about 5 years ago

And I need to consider another possible internal representation of the 4GL character type. May be integer array is better than String. Just because using String we do character conversion every time we use String. Not sure, need to investigate.

Implicit conversion is only needed in the following cases:

text data is being read from or written to an external source/target (a file or the screen); AND
the input or output codepage is configured to be different from the internal codepage

This is a very small number of cases. For this reason I think we definitely want to keep character data as strings. Otherwise all of the internal usage (which is most of the usage) will be very expensive because we will constantly be converting the int[] data into a String and then back to int[].

#43 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11308. The tescases repo has been updated to rev. 1836.

This adds some new testcases to get the status of the current CODEPAGE-CONVERT function result. Also the change has support for UNDEFINED special code page. The idea is applying to source or target the UNDEFINED means no conversion. Usually both Java calls to String.getBytes(CodePage) and new String(inputByteArray, CodePage) does the respective conversion inside JVM. If CP is UNDEFINED we need to discard the conversion using String.getBytes() and new String(inputByteArray).

The testcase uast\i18n\cp_convert.p demonstrates the consistency of the FWD CODEPAGE-CONVERT implementation. Double conversion gets the initial string value. And this means looks like we do not need to implement the 4GL specific processing for text transformation. Everything can be done with regular Java tools.

The test uast\i18n\char_convert.p is the demo for getting integer value of the current character with provided code page. The test is UI independent, just displaying char code, not char itself. It shows the FWD current implementation for CHR/ASC are need to be updated to add code page support. These functions are useful for OS independent code page testing so I'm planning to add respective support next.

The other constraint has been found is for usage of the character constants that are not supported by current Java charset(UTF-8 in Linux). This means if we encounter the character that can not be displayed in Java CP - it will be converted to UTF-8 ? char. This is what we have to take into account working with source tree. Even if we can avoid this during conversion we will not able to compile it with errors like this:

...
error: unmappable character for encoding utf-8
[ant:javac]       character cp850string = TypeFactory.character("text with umlaut (�)");
[ant:javac] 
...

So it is better to avoid the hardcoded text constants with extended chars in a source tree.

The preliminary conclusion:
1. The implementation of the CODEPAGE-CONVERT is OK, at least for now I do not see the issues.
2. For further testing we need the CHR/ASC to have the support for code-page options for source and target.

Working on the point 2.

#44 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

And I need to consider another possible internal representation of the 4GL character type. May be integer array is better than String. Just because using String we do character conversion every time we use String. Not sure, need to investigate.

Implicit conversion is only needed in the following cases:

text data is being read from or written to an external source/target (a file or the screen); AND

the input or output codepage is configured to be different from the internal codepage

This is a very small number of cases. For this reason I think we definitely want to keep character data as strings. Otherwise all of the internal usage (which is most of the usage) will be very expensive because we will constantly be converting the int[] data into a String and then back to int[].

Agreed, we need to keep the implementation as effective as possible leaving the String as backend for character.

#45 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11309.

The update adds target/source code page handling for CHR function. The implementation is still under debugging but the base functionality is here. The next step will be to complete CHR and add ASC.

#46 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11310.

This is completed implementation for CHR/ASC functions with support target and source code pages. New testcases have been written to verify approach(updated to rev. 1837).

The current implementation of the ASC supports DBCS and Unicode allowing return value more than 255. Need to update gaps rule file to reflect the status and do more tests to verify implementation. Also the planned next steps is to handle upper/lower/collation transformation functions and prepare schedule for next implementation steps.

#47 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11311.

The update reflects gap marking status change for I18N related features that already implemented.

#48 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11312.

This small update fixes double element for KW_LC in expressions.rules for gap markup.

Also the testcases repo has been updated to rev. 1838. With new testcases to verify LC/CAPS functions for CHARACTER and LONGCHAR variables. Testing shows the implementation is OK without additional tables to construct for UPPER case to LOWER case and back.

Preparing plan for further work.

#49 Updated by Eugenie Lyzenko about 5 years ago

So far the next steps plan will be to focus on stream related codepage processing:
- SESSION:CPSTREAM
- INPUT STREAM CONVERT runtime support
- (OUTPUT TO)/(INPUT THROUGH) CONVERT option
- NO-MAP I/O option
- NO-CONVERT runtime support

#50 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been rebased with trunk 11304, new revision is 11313.

#51 Updated by Eugenie Lyzenko about 5 years ago

The testing shows the 4GL returns ISO8859-1 for both SESSION:CPSTREAM and SESSION:CPINTERNAL. In FWD we have UTF-8 as SESSION:CPSTREAM and ISO8859-1 as SESSION:CPINTERNAL. Actually we have UTF-8 for both but substitute with ISO8859-1 for SESSION:CPINTERNAL.

Should we return UTF-8 too at least for Linux server instance?

These options are very important for proper code page related IO handling because the result will be different depending on source/target CP. So we need proper strategy. What are our plans here?

What about to define -cpstream and -cpinternal overrides in directory.xml file? This way we could fine tune code page options for particular customer application.

#52 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11314.

Testcases repo has been updated to revision 1840.

Update adds runtime support for INPUT ... FROM CONVERT statement. Using the same approach that is used in CODEPAGE-CONVERT function. The new testcase is used to demonstrate this. Also the attribute SESSION:CPSTREAM and SESSION:CPINTERNAL are also fully supported in FWD. With notes I've previously mentioned. I think we need to provide customer the opportunity to customize -cpinternal and -cpstream options in directory.xml to have proper conversion while reading from/writing to external file.

Continue working on implementing (OUTPUT TO)/(INPUT THROUGH) CONVERT option.

#53 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11315.

Testcases repo has been updated to revision 1841.

Update adds runtime support for CONVERT/NO-CONVERT options in stream based IO. Also the handling approach has been changed to have conversion by default. So to ignore the conversion process the option NO-CONVERT must be explicitly specified. In addition stream constructor gets default values for -cpstream and -cpinternal variables to be used if source or target code page overrides are not defined.

The note for usage -cpstream and -cpinternal as source and target. When the stream is in reading the source code page is -cpstream while target code page is -cpinternal. But in the case of writing the source and target should be swapped, -cpinternal become a source code page and -cpstream become a target code page respectively. The implementation should support both named and unnamed streams. Including INPUT THROUGH version.

Several testcases added to testcase repo. Just to have simple tests to confirm implementation consistency. However we need to have some complex tests from 4GL experts to debug/verify the implementation.

The next step is to check/implement NO-MAP option and string resource bundles and translation manager replacement described in #3817.

#54 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11316.

Testcases repo has been updated to revision 1842.

Adding the conversion and runtime support for NO-MAP stream IO option. Testcases updated to check MAP/NO-MAP option.

Greg,

Do we need to implement MAP option here? It is related to PROTERMCAP file entry used to char conversion. If we need the option to be implemented - I have a question related. The simple 4GL code(from i18n/stream_cp_map.p):

...
INPUT FROM "stream_cp_input.txt" MAP "hp/italian".
...

converting to incorrect call:

...
         UnnamedStreams.assignIn(StreamFactory.openFileStream("stream_cp_input.txt", false, false), "hp/italian");
...

And I'm not clear why MAP option is considering as parameter for UnnamedStreams.assignIn call. No KW_MAP related rules added. Why this is not happening with CONVERT TARGET "targetCP" stream option for "targetCP"? Is there a simple answer I'm missing? This can save my time I'll spend to find a root cause.

#55 Updated by Greg Shah about 5 years ago

The testing shows the 4GL returns ISO8859-1 for both SESSION:CPSTREAM and SESSION:CPINTERNAL.

This will depend on the installation. I don't know if it is explicitly set during OpenEdge installation or if it is inferred from the locale.

In FWD we have UTF-8 as SESSION:CPSTREAM and ISO8859-1 as SESSION:CPINTERNAL. Actually we have UTF-8 for both but substitute with ISO8859-1 for SESSION:CPINTERNAL.

We did this long ago because we had encountered 4GL code that expected ISO8859-1 but there was no real reason that we couldn't use UTF-8 internally. So we "hacked" this.

The current approach is not correct and it may cause that application to have an issue, but we need to fix this now.

Should we return UTF-8 too at least for Linux server instance?

For now, we are not going to have different CPINTERNAL values based on the user's session. Instead we need to base this on the JVM default encoding. It should have nothing to do with the operating system. In the future, we will need to honor different CPINTERNAL by session. But this will require much more than just how we set and report that value, we would also need to handle all String operations in that encoding. That is not for now.

I do think we need to always return a codepage name that is recognizable. I'm not sure that the JVM default encoding will always have a name that matches what OpenEdge would return. For example, in Java what is the Windows 1252 encoding name? I think it is windows-1252. In the 4GL, it will return as 1252. Please create a map of the names that can translate between these.

What about to define -cpstream and -cpinternal overrides in directory.xml file? This way we could fine tune code page options for particular customer application.

I agree that we need the equivalent of -cpstream in the directory. Please go ahead and add a mechanism to set the value of all of the codepages (-cpstream, -cpinternal, -cpprint...) from the directory. It should be able to be defined/overridden at all the different levels (global default, server default, group, account...).

This value should be specified using the 4GL CP name, which we must map to the Java encoding name (see above).

Please add conversion and simple runtime support for the attribute getters for all of the codepage values (most are missing). This must include CHARSET (which I think is the same thing as CPINTERNAL, but we need to check) and for CODEPAGE.

These values should have a compatible default if not overridden in the directory. But if specified in the directory (look this up using the directory access methods that do the hierarchical search), then the value specified should be returned.

We will only honor the actual value for CPSTREAM right now. The others won't have support, so the runtime support for those should be marked as stubs.

For now, I think the CPINTERNAL value returned must be based purely on the JVM default charset, which is set by the JVM based on the JVM's locale.

This testcase proves that the CP* attributes are not settable (see testcases/uast/i18n/):

/* setting CP* values causes **CHARSET is not a settable attribute for PSEUDO-WIDGET. (4052) */
/* but the ERROR-STATUS-ERROR will be false */

/* it doesn't matter the value that is set, it could be "utf-8", "garbage" or even ? (unknown value) */
/* the result is always the same warning */

session:cpinternal = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cpstream = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cpterm = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cplog = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cpprint = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cprcodein = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cprcodeout = "ISO8859-1" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cpcase = "basic" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

session:cpcoll = "basic" no-error.
if error-status:error or not error-status:get-number(1) = 4052 then message "ERROR: expected error 4052 to be returned.".

message "Finished successfully.".

Also the attribute SESSION:CPSTREAM and SESSION:CPINTERNAL are also fully supported in FWD.

I don't think we can say that CPINTERNAL is fully supported. We don't modify the encoding used for internal string operations in a FWD user's session, so the real support is not there. This is still "partial" level support right now. Please add a comment to the gap rules for cpinternal .

Also the handling approach has been changed to have conversion by default.

This should only be the case if CPSTREAM is different from CPINTERNAL, right? So in the default case there should be NO conversion because these two values are always the same in the 4GL unless the command line override has been provided (e.g. -cpstream). Of course, specifying the source or target codepage in the 4GL code will override the CPSTREAM or CPINTERNAL in this calculation, so we need to handle that at runtime.

Please check the default (whether CPSTREAM = CPINTERNAL) once and save the result in the context-local area. Then implement the default conversion or no conversion based on this.

The note for usage -cpstream and -cpinternal as source and target. When the stream is in reading the source code page is -cpstream while target code page is -cpinternal. But in the case of writing the source and target should be swapped, -cpinternal become a source code page and -cpstream become a target code page respectively. The implementation should support both named and unnamed streams.

Yes, understood.

Do we need to implement MAP option here?

Maybe. At least the conversion should be supported. But I need to understand what the runtime behavior is for MAP.

It is related to PROTERMCAP file entry used to char conversion.

Can you provide more details? What actually happens here when this is specified? Where does the translation mapping data come from (is it hard coded into the protermcap)? How does it mix with the CP* attribute support?

And I'm not clear why MAP option is considering as parameter for UnnamedStreams.assignIn call.

This is because there is no processing for the KW_MAP option. If there was, then there would be a peerid and it would properly emit into the parent. This should be fixed easily.

#56 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revision 11317.

The update adds conversion and runtime(still no real work, just storing the option value).

Yes, the issue was in missing createPeerAst call, thanks for help. Also I have modified the progress.g to change KW_MAP options approach to literal | filename[null]. This way we can support both MAP proterm-entry and MAP "proterm-entry", the same way the 4GL does. The previous version supports only MAP "proterm-entry" case. Please review this change.

#57 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

Do we need to implement MAP option here?

...
Maybe. At least the conversion should be supported. But I need to understand what the runtime behavior is for MAP.

It is related to PROTERMCAP file entry used to char conversion.

Can you provide more details? What actually happens here when this is specified? Where does the translation mapping data come from (is it hard coded into the protermcap)? How does it mix with the CP* attribute support?

The 4GL doc states (for INPUT FROM statement):

The protermcap-entry value is an entry from the PROTERMCAP file. Use MAP to read from an
input stream that uses a different character translation from the current stream. Typically,
protermcap-entry is a slash-separated combination of a standard device entry and one or more
language-specific add-on entries (MAP laserwriter/french or MAP hp2/spanish/italian, for example).
The AVM uses the PROTERMCAP entries to build a translation table for the stream. Use NO-MAP
to make the AVM bypass character translation altogether.

For now we are limited the fact the usage of the PROTERMCAP file is for Linux/Unix systems only. I think it does not work in Windows. As for other 4GL parts we need the testcase to investigate how it actually works(the only document is not good enough source to have the real picture as we know from previous experience). So the issue is missing Linux based system with ABL installed. We can suspect the doc's "build a translation table for the stream" means overriding source/target codepage translation handling for some/all characters.

#58 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

The testing shows the 4GL returns ISO8859-1 for both SESSION:CPSTREAM and SESSION:CPINTERNAL.

This will depend on the installation. I don't know if it is explicitly set during OpenEdge installation or if it is inferred from the locale.

In FWD we have UTF-8 as SESSION:CPSTREAM and ISO8859-1 as SESSION:CPINTERNAL. Actually we have UTF-8 for both but substitute with ISO8859-1 for SESSION:CPINTERNAL.

We did this long ago because we had encountered 4GL code that expected ISO8859-1 but there was no real reason that we couldn't use UTF-8 internally. So we "hacked" this.

The current approach is not correct and it may cause that application to have an issue, but we need to fix this now.

OK.

Should we return UTF-8 too at least for Linux server instance?

For now, we are not going to have different CPINTERNAL values based on the user's session. Instead we need to base this on the JVM default encoding. It should have nothing to do with the operating system. In the future, we will need to honor different CPINTERNAL by session. But this will require much more than just how we set and report that value, we would also need to handle all String operations in that encoding. That is not for now.

OK. Understood.

I do think we need to always return a codepage name that is recognizable. I'm not sure that the JVM default encoding will always have a name that matches what OpenEdge would return. For example, in Java what is the Windows 1252 encoding name? I think it is windows-1252. In the 4GL, it will return as 1252. Please create a map of the names that can translate between these.

Such map already implemented in I18nOps helper class.

#59 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

Please add conversion and simple runtime support for the attribute getters for all of the codepage values (most are missing). This must include CHARSET (which I think is the same thing as CPINTERNAL, but we need to check) and for CODEPAGE.

These values should have a compatible default if not overridden in the directory. But if specified in the directory (look this up using the directory access methods that do the hierarchical search), then the value specified should be returned.

We will only honor the actual value for CPSTREAM right now. The others won't have support, so the runtime support for those should be marked as stubs.

Task branch 3753a has been updated for review to revisions 11318, 11319.

This changes support level for CPINTERNAL attribute to partial.

Also implementing conversion support for all CP* code page related attributes, marking runtime support as stubs.

Currently due to the attributes have read-only access, only getters are implemented. But according to your testcase(settable_cp_attributes.p) the setters are need to be implemented as well, correct? With the only purpose to generate the error while execution. If we will have no conversion support for setters any call to session:cpterm = "ISO8859-1" will cause the conversion error I think. So please clarify this point.

Planning the rebase 3753a with the recent trunk in a 5-10 min.

#60 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been rebased with trunk 11305, new revision is 11320.

#61 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

...

Also the handling approach has been changed to have conversion by default.

This should only be the case if CPSTREAM is different from CPINTERNAL, right?

Yes.

So in the default case there should be NO conversion because these two values are always the same in the 4GL unless the command line override has been provided (e.g. -cpstream). Of course, specifying the source or target codepage in the 4GL code will override the CPSTREAM or CPINTERNAL in this calculation, so we need to handle that at runtime.

This is alredy implemented in I18nOps conversion worker. If source CP is the same as the target CP no conversion happening. This check is performed for every call that is code page capable. So agree, may be not very optimal approach.

Please check the default (whether CPSTREAM = CPINTERNAL) once and save the result in the context-local area. Then implement the default conversion or no conversion based on this.

OK. Will re-work on local context basis.

#62 Updated by Eugenie Lyzenko about 5 years ago

The settable_cp_attributes.p

session:cpinternal = "ISO8859-1" no-error.

converts to:

...
         silent(() -> SessionUtils.readOnlyError("cpinternal"));

         if (_or(ErrorManager.isError(), () -> not(isEqual(ErrorManager.getErrorNumber(1), 4052))))
         {
            message("ERROR: expected error 4052 to be returned.");
         }
...

So there is no conversion/compilation issues. I think we do not need to implement setters for SESSION code pages related attributes.

#63 Updated by Greg Shah about 5 years ago

So there is no conversion/compilation issues. I think we do not need to implement setters for SESSION code pages related attributes.

Correct.

#64 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been rebased with trunk 11306, new revision is 11321.

#65 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revisions 11322, 11323.

The update adds session context area to store cached variables. Also introduces the support for directory.xml overrides for all CP related internal variables. Initially when the service is called for the first time FWD requests for overrides. If the variable is still null - there are no directory definitions and we will use current charset value obtained from JVM.

Also the no-translation mode(when target CP equals to source CP) is automatically handling on Stream class level when no explicit target/source are defined. Otherwise we need to do checking for every stream based operation. We can define if we need the real transformation when both source and target are defined for stream.

#66 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revisions 11324.

The update fixes the bug in Stream while convert flag computing. Also added code to properly handle the directory service on both client and server sides. This unification is required because some operation uses server side local session area, while others need the client side asking the directory values. The using DirectoryManager.getInstance(). Continue working with next part of the task - translation manager replacement.

#67 Updated by Greg Shah about 5 years ago

Continue working with next part of the task - translation manager replacement.

Actually, we are going to defer this work (and #3817) until the summertime. This is not needed for our next customer milestones.

As long as there is no conversion/compilation issue with reading and writing SESSION:CURRENT-LANGUAGE, then I think the rest of the work on translation manager support can be paused.

#68 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

Continue working with next part of the task - translation manager replacement.

Actually, we are going to defer this work (and #3817) until the summertime. This is not needed for our next customer milestones.

OK.

As long as there is no conversion/compilation issue with reading and writing SESSION:CURRENT-LANGUAGE, then I think the rest of the work on translation manager support can be paused.

The conversion/compilation is OK for CURRENT-LANGUAGE function/statement. What should be investigated additionally I think is how changing CURRENT-LANGUAGE affects all others CP* attributes for SESSION. For example CPINTERNAL, CPSTREAM, CHARSET etc. Do we need to implement 4GL behavior here?

#69 Updated by Greg Shah about 5 years ago

We are arranging for someone to write 4GL testcases to do a deep/comprehensive look at the I18N features.

What should be investigated additionally I think is how changing CURRENT-LANGUAGE affects all others CP* attributes for SESSION. For example CPINTERNAL, CPSTREAM, CHARSET etc.

I agree. If there is any relationship there, we need to understand what it is. Please make a list of all the questions that you have which are not covered by the items in notes 11, 38 and 39. Post that list in this task so that we can include those questions in the testcases work.

#70 Updated by Eugenie Lyzenko about 5 years ago

4GL tests requirement

The following option's dependency need to be tested in 4GL environment from CURRENT-LANGUAGE settings(most of them are SESSION attributes):

CPINTERNAL
CPSTREAM
CPCASE
CPCOLL
CPLOG
CPPRINT
CPRCODEIN
CPRCODEOUT
CPTERM
CHARSET
CODEPAGE

The possible 4GL test scenario is:
1. Check the initial attribute value from the list.
2. Change the CURRENT-LANGUAGE.
3. Re-check the attribute value to find out if it is changed.

In a perfect world we need to know the behavior for both Linux and Windows 4GL system.

#71 Updated by Eugenie Lyzenko about 5 years ago

4GL tests requirement

Another point of the interest for 4GL testing is the behavior of the CHR() function in characters that outside the 255 range.

For now I'm leaving the TODO commented out code to the moment we have clear picture of the > 255 integer code values in CHR().

#72 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revisions 11325.

This is the improvement for mapping Java code page names to Progress ones. The idea is to have bi-directional mapping to speed up the process of returning 4GL compatible values in getCP*() functions.

As far as I understand the only work left for now in this task is to create the list of the specifications for 4GL testcases to have comprehensive base for further code debugging. The tests itself will be written by someone other, right?

Do I need to shift to another task? Make regression tests and merge the 3753a into trunk?

#73 Updated by Greg Shah about 5 years ago

As far as I understand the only work left for now in this task is to create the list of the specifications for 4GL testcases to have comprehensive base for further code debugging. The tests itself will be written by someone other, right?

Yes.

Make regression tests and merge the 3753a into trunk?

Yes. I will do a code review.

Do I need to shift to another task?

I'll let you know.

#74 Updated by Eugenie Lyzenko about 5 years ago

4GL tests requirement

More test related area to investigate:

LC()/CAPS() functions for different code page combinations(-cpinternal/-cpcase). Single/double byte charset support.
The set of tests to verify different -cpinternal/-cpstream combinations for stream based input and output.
ASC()/CHR() functions for different code page combinations in source/target/missing defaults. Single/double byte charset support.

#75 Updated by Eugenie Lyzenko about 5 years ago

4GL tests additional wishes

It will be good to also have the 4GL testcases that leverage the following Progress CONVMAP.CP related transformations:

table that defines is the current character is alpha
collation table

Also for Linux based system it will be good to have working test with PROTERMCAP file mapping usage.

#76 Updated by Greg Shah about 5 years ago

Code Review 3753a Revision 11325

This is a really good update. Some feedback:

1. The change in DirectoryServer is not correct. The idea of ID_ABSOLUTE is that the full path and node is specified by the caller. What you have implemented is an approach that uses Utils.getDirectoryNodeWorker() to service the request of both ID_ABSOLUTE and ID_RELATIVE.

Utils.getDirectoryNodeWorker() can only be used to implement ID_RELATIVE because it checks multiple paths to see if the given node is there:

Utils.DirScope.ACCOUNT search:

1. /server/<serverID>/runtime/<account_or_group>/<id>.<project>
2. /server/<serverID>/runtime/<account_or_group>/<id>
3. /server/<serverID>/runtime/default/<id>.<project>
4. /server/<serverID>/runtime/default/<id>
5. /server/default/runtime/<account_or_group>/<id>.<project>
6. /server/default/runtime/<account_or_group>/<id>
7. /server/default/runtime/default/<id>.<project>
8. /server/default/runtime/default/<id>

Utils.DirScope.SERVER search:

1. /server/<serverID>/<id>.<project>
2. /server/<serverID>/<id>
3. /server/default/<id>.<project>
4. /server/default/<id>

If search scope is Utils.DirScope.BOTH, then we do Utils.DirScope.ACCOUNT and if we don't find something then we do Utils.DirScope.SERVER.

ID_ABSOLUTE would be checking ONLY the given path. For example, the caller might provide this: /server/default/runtime/default/some_node. There is an argument that we should probably check 2 paths here (the exact one given and then another with .<project> added so that project tokens can be honored. But the point I'm trying to make is that we don't have any Utils helpers to implement ID_ABSOLUTE. If we had these, we would have already resolved the TODOs in DirectoryService.

The problem here is that we have implemented ID_RELATIVE to mean Utils.DirScope.ACCOUNT. Perhaps we need to provide additional options like ID_RELATIVE_ACCOUNT, ID_RELATIVE_SERVER and ID_RELATIVE_BOTH. We can "alias@ ID_RELATIVE to be the same meaning as ID_RELATIVE_ACCOUNT so that existing code will not break.

2. In progress.g, the change in io_options should reference STRING instead of literal. The reason: if you specify literal, then you can also match lots of non-string things like true or 01/01/1999 or -3.14.

In gaps/expressions.rules, I think the kw_cp_cvt, kw_get_codp, kw_is_cp_fx and kw_get_cp should probably be marked rt_lvl_basic (instead of rt_lvl_full) until we have run the 4GL testcases to confirm full compatibility.

3. The getters for SESSION:CP* should be all in one place. Today we have them in both EnvironmentOps and I18nOps. Let's put them all in I18nOps.

4. In I18nOps.getCodePages(), I wonder if the order will match the 4GL. This could be a compatibility issue.

5. In I18nOps.getCodePages(), the returned string will always end in a ,. Is that how the 4GL does it?

6. We need explicit processing of unknown value for Text and character parameters to TextOps.codepageConvert(), character.asc() and character.chr(). It is not safe to call getValue() on these. The value member can be out of sync with the unknown flag, leading to wrong results.

7. In I18nOps.codepageConvert(), this code:

sourceJavaCP = convmap2Java.get(((longchar)text).getCodePage().
                                                toStringMessage().
                                                toUpperCase());

should use an alternate version longchar._getCodePage() which returns the proper string directly instead of using the wrapper version. You'll have to create this new version.

8. In I18nOps.codepageConvert(), in this code:

               // target CP is valid, check the source CP
               if (text instanceof longchar)
               {
...
               }

I think this should have a check to see if the longchar has its codepage fixed. If it is not fixed, won't this return unknown value which will appear like a codepage named "?"? This seems wrong. I think if the logchar has no fixed codepage, then it should probably default to the SESSION:CPINTERNAL, right?

9. In I18nOps.chr(), the use of a byte[4] to convert seems wrong. It seems like this could treat values greater than 255 as 4 single byte characters depending on the source/target codepages specified.

10. I think it is incorrect to pass a char as input to I18nOps.asc() and it is incorrect to return a char from I18nOps.chr(). Can't we encounter unicode characters that would be more than 2 bytes in size? I think we should be passing this as int to handle all possible unicode characters (e.g. UTC32).

11. In Stream.setConvertSource(String), why is convert set true when targetCp null? It seems like this is the opposite of what should happen. This is especially the case since it is possible since sourceCp may be set to null. There is the similar question for Stream.setConvertTarget() and the use of sourceCp null.

12. Stream is used on both the server and the client. I think the direct usage of EnvironmentOps is a problem in this case. I also think we need to send cpinternal and cpstream values to the client once for the whole session, otherwise it will be expensive to use an up-call to the server each time a stream instance is created.

13. The Stream.convert member is modified multiple times, during construction, possibly when the converted 4GL code calls Stream.setConvert() and during setConvertSource|Target(). This seems like the order may be different from the 4GL processing, causing the flag to be set to the wrong value.

14. In gaps/lang_stmts.rules, I think the kw_fix_cp should probably be marked rt_lvl_basic (instead of rt_lvl_full) until we have run the 4GL testcases to confirm full compatibility.

15. methods_attributes.rules needs a history entry.

#77 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

5. In I18nOps.getCodePages(), the returned string will always end in a ,. Is that how the 4GL does it?

I think it is not the case. I thought about this possibility and the code:

...
   public static character getCodePages()
   {
      // TODO: implement me with taking into account the convmap.cp compatibility
      StringBuilder sbRes = new StringBuilder();

      // getting available code pages
      Iterator csIter = convmap2Java.keySet().iterator();
      while (csIter.hasNext())
      {
         sbRes.append(csIter.next());
         if (csIter.hasNext())
         {
            sbRes.append(",");
         }
      }
...

has protection against it. After csIter.next() the csIter.hasNext() returns true when there is more items in iterator. For last item it is not a true so after last item adding the csIter.hasNext() is false and , is not appended to the end of the string.

#78 Updated by Eugenie Lyzenko about 5 years ago

File code_pages_order.jpg added

The original 4GL ordering for GET-CODEPAGES():

GET-CODEPAGES result order

#79 Updated by Greg Shah about 5 years ago

I think it is not the case. I thought about this possibility and the code:

Sorry, I mis-read the code.

The original 4GL ordering for GET-CODEPAGES:

Please add this comment in the static {} initializer (where convmap2JavaDefault is initialized):

// These mappings are explicitly being added in the same exact order
// they appear in the 4GL GET-CODEPAGES() function. Do not change
// the order, otherwise the result of that function will be incorrect.

#80 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

11. In Stream.setConvertSource(String), why is convert set true when targetCp null? It seems like this is the opposite of what should happen. This is especially the case since it is possible since sourceCp may be set to null. There is the similar question for Stream.setConvertTarget() and the use of sourceCp null.

The idea behind this logic is: assume only sourceCP or targetCP is set to valid value while other value (targetCP or souceCP) remains null. What does it mean from conversion perspective? I think it means source != target in general and causes the conversion to be active. Then if missing CP is resolved to default in I18nOps code and finally we found the source and target are the same the actual conversion code will be ignored in I18nOps. Yes, this will happen not too early but I guess extra check is better than wrong handling.

#81 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

13. The Stream.convert member is modified multiple times, during construction, possibly when the converted 4GL code calls Stream.setConvert() and during setConvertSource|Target(). This seems like the order may be different from the 4GL processing, causing the flag to be set to the wrong value.

Yes this is the design update approach. In 4GL the explicit setting for this member is in NO-CONVERT option, while default value is true. There is no 4GL calls to get the current option value. My idea is the most recent related call Stream.setConvert() or setConvertSource|Target() will define the current effective value for the flag. All the calls are happening on the stream definition step. I think it is safe but may be I'm missing something.

BTW. One more idea for 4GL testcases I've got. We need to know how code page conversion approach if we change CPINTERNAL or CPSTREAM several time between I/O operations. Will the conversion be affected? I've upated the 4GL test related entry.

#82 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

12. Stream is used on both the server and the client. I think the direct usage of EnvironmentOps is a problem in this case. I also think we need to send cpinternal and cpstream values to the client once for the whole session, otherwise it will be expensive to use an up-call to the server each time a stream instance is created.

With my upcoming notes resolution update there will be no EnvironmentOps calls for code page related code in Stream. The I18nOps class has local context area for both server and client. The respective constants are initialized once per session so if my understanding is correct there will be no extra calls from client to server for every IO operation.

#83 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revisions 11326.

The notes resolved except point 1(not yet finished) and points 11, 12, 13 (there are something to discuss there).

Also for note 9. Changed to use code point string constructor for greater than 255 chars.

Continue working with note 1.

#84 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revisions 11327.

Completed code review notes resolution. To continue work on notes 11, 12, 13 I need some feedback to decide what to do.

#85 Updated by Greg Shah about 5 years ago

We need to know how code page conversion approach if we change CPINTERNAL or CPSTREAM several time between I/O operations. Will the conversion be affected? I've upated the 4GL test related entry.

I don't think this is possible. These attributes are read-only as found in #3753-55. Th only way I have seen to set these is with the command line options. If that is correct, then these are fixed at the start of the 4GL process. Do you know of a way to change these values during the 4GL session instead of just with command line options?

If not, please edit the test entry to remove this item.

#86 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

We need to know how code page conversion approach if we change CPINTERNAL or CPSTREAM several time between I/O operations. Will the conversion be affected? I've upated the 4GL test related entry.

I don't think this is possible. These attributes are read-only as found in #3753-55. Th only way I have seen to set these is with the command line options. If that is correct, then these are fixed at the start of the 4GL process. Do you know of a way to change these values during the 4GL session instead of just with command line options?

Yes, you are right, it is read only features setting up one time on application start. Sorry for confusion, I've lost it for a some time.

If not, please edit the test entry to remove this item.

Removed.

#87 Updated by Greg Shah about 5 years ago

Eugenie Lyzenko wrote:

Greg Shah wrote:

11. In Stream.setConvertSource(String), why is convert set true when targetCp null? It seems like this is the opposite of what should happen. This is especially the case since it is possible since sourceCp may be set to null. There is the similar question for Stream.setConvertTarget() and the use of sourceCp null.

The idea behind this logic is: assume only sourceCP or targetCP is set to valid value while other value (targetCP or souceCP) remains null. What does it mean from conversion perspective? I think it means source != target in general and causes the conversion to be active. Then if missing CP is resolved to default in I18nOps code and finally we found the source and target are the same the actual conversion code will be ignored in I18nOps. Yes, this will happen not too early but I guess extra check is better than wrong handling.

The first call to either one of setConvertSource() or setConvertTarget() will always find the other value to be null. For example, if both setConvertSource() and setConvertTarget() are being called for a stream, the first one called will set convert true and the result of the second call will depend on the specific values being used.

It is possible to specify NO-CONVERT in the stream definition. That will call setConvert(false). Then if other code specifies CONVERT SOURCE x TARGET y (or one of the other forms), then this will be overridden. Is that what the 4GL does in this case?

Interestingly enough, when neither NO-CONVERT or CONVERT ... is present, then the convert flag should default to true and the source/target codepages are simply the cpinternal and cpstream (on output and the other way around for input).

My concern is that we can lose state in all this processing. It seems to me that convert should default to true and only ever be flipped to false if NO-CONVERT is specified.

The setConvertSource() and setConvertTarget() don't need to change that flag (unless you find evidence that the 4GL does it that way). For example, that an earlier NO-CONVERT is overridden by a later CONVERT .... We should check the other ordering too (CONVERT... followed by NO-CONVERT).

Then at input or output time, we should be able to resolve the source/target codepages and whether conversion is needed as follows:

   String sourceCodepage(boolean input)
   {
      return (sourceCp == null) ? sourceCp : (input ? streamCp : internalCp);
   }

   String targetCodepage(boolean input)
   {
      return (targetCp == null) ? targetCp : (input ? internalCp : streamCp);
   }

   boolean needsConvert(String sourceCp, String targetCp)
   {
      return convert && 
             ((sourceCp == null && targetCp != null) ||
              (sourceCp != null && targetCp == null) ||
              !sourceCp.equalsIgnoreCase(targetCp));
   }

Does this make sense? I think this will work so long as we ensure that sourceCp and targetCp are never set to "" or a string with " " (whitespace).

#88 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

The first call to either one of setConvertSource() or setConvertTarget() will always find the other value to be null. For example, if both setConvertSource() and setConvertTarget() are being called for a stream, the first one called will set convert true and the result of the second call will depend on the specific values being used.

It is possible to specify NO-CONVERT in the stream definition. That will call setConvert(false). Then if other code specifies CONVERT SOURCE x TARGET y (or one of the other forms), then this will be overridden. Is that what the 4GL does in this case?

Interestingly enough, when neither NO-CONVERT or CONVERT ... is present, then the convert flag should default to true and the source/target codepages are simply the cpinternal and cpstream (on output and the other way around for input).

My concern is that we can lose state in all this processing. It seems to me that convert should default to true and only ever be flipped to false if NO-CONVERT is specified.

The setConvertSource() and setConvertTarget() don't need to change that flag (unless you find evidence that the 4GL does it that way). For example, that an earlier NO-CONVERT is overridden by a later CONVERT .... We should check the other ordering too (CONVERT... followed by NO-CONVERT).

Then at input or output time, we should be able to resolve the source/target codepages and whether conversion is needed as follows:

[...]

Does this make sense?

Yes, I'm still thinking about this too, having similar solution(actually I prepared the changes and was going to upload it):
1. Introduce another flag: convertInt.
2. The legacy convert flag will store only explicit change by NO-CONVERT option(it will always match the current CONVERT read-only attribute):

...
   public void setConvert(boolean convert)
   {
      this.convert = convert;
      this.convertInt = convert;
   }   
...
   public boolean getConvert()
   {
      return convert;
   }

3. The effective convert flag can be changed in different places(leaving convert untouched):

   public void setConvertSource(String cp)
   {
...
      if (sourceCp != null && targetCp != null)
      {
         convertInt = !targetCp.equalsIgnoreCase(sourceCp);
      }
   }
...
   public void setConvertTarget(String cp)
   {
...
      if (sourceCp != null && targetCp != null)
      {
         convertInt = !sourceCp.equalsIgnoreCase(targetCp);
      }
   }
...
   private void initDefaultCodePages()
   {
...
      convertInt = !streamCp.equalsIgnoreCase(internalCp);
   }

The initDefaultCodePages() is always calling on String construction while setConvert(Source|Target) are optional and we can not sure which one will be first, so I've made duplication.

4. Then when the IO should happen the conversion condition become:

      if (convertInt && convert)
      {
         make I/O operation with conversion
      }

Is it acceptable approach? If not please let me know and I will rework with your case.

I think this will work so long as we ensure that sourceCp and targetCp are never set to "" or a string with " " (whitespace).

We certainly need to add the protection for this case.

#89 Updated by Greg Shah about 5 years ago

I prefer to go with the workers as I documented them. The primary reason is that the logic of whether or not we are converting (and which codepages are used) is very clear. It can be seen from just the 3 helper methods. A secondary reason is that the code as you've recorded it does not handle the cases where we have only one (sourceCp or targetCp) set null and the other not set. I am not opposed to caching the result of the first input or output, but I don't want to do it "as we go". There is no advantage to doing that and it just spreads the calculate out over lots of places, making it harder to see the complete logic.

#90 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

I prefer to go with the workers as I documented them. The primary reason is that the logic of whether or not we are converting (and which codepages are used) is very clear. It can be seen from just the 3 helper methods. A secondary reason is that the code as you've recorded it does not handle the cases where we have only one (sourceCp or targetCp) set null and the other not set. I am not opposed to caching the result of the first input or output, but I don't want to do it "as we go". There is no advantage to doing that and it just spreads the calculate out over lots of places, making it harder to see the complete logic.

OK. Did you mean to implement helper methods in Stream class, correct?

#91 Updated by Greg Shah about 5 years ago

Did you mean to implement helper methods in Stream class, correct?

Yes.

#92 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

Did you mean to implement helper methods in Stream class, correct?

Yes.

Done. Task branch 3753a has been updated for review to revisions 11328.

Notes resloution for points 11 and 13.

#93 Updated by Greg Shah about 5 years ago

Code Review Task Branch 3753a Revision 11328

The changes are a good step.

I have reworked the code to make Stream more efficient, to fix a bug in my original Stream code from #3753-87, to eliminate code duplication in Stream and to move the actual conversion processing in the I18nOps worker code which avoids use of BDT wrappers when called from Stream.

Please see revision 11329. If you have no concerns, then retest your testcases with this version. Fix any issues. Then you can start regression testing.

#94 Updated by Greg Shah about 5 years ago

Revision 11330 fixes a bug in my management of the caching flag.

#95 Updated by Eugenie Lyzenko about 5 years ago

I'm OK with changes. Minor fixing.

Task branch 3753a has been updated for review to revisions 11331.

This is the fix for switch operator handling issue that gets incorrect search type calculation and prevents BOTH mode search.

So far the local tests are OK on my guess. Starting the regression tests for conversion and runtime.

#96 Updated by Eugenie Lyzenko about 5 years ago

Greg,

The conversion testing is in progress.

If the conversion will be OK can we include the changes for 4010 and 4066 into 3753a branch to speed up the fixes to be in trunk?

#97 Updated by Greg Shah about 5 years ago

If the conversion will be OK can we include the changes for 4010 and 4066 into 3753a branch to speed up the fixes to be in trunk?

Yes, go ahead with this.

#98 Updated by Eugenie Lyzenko about 5 years ago

Conversion passed, source are identical except one added new call to setNoMap() which is OK considering NO-MAP option new support.

But there are many compilation errors like this:

...
    [javac] /home/evl/testing/majic/src/aero/timco/majic/item/Item58R.java:589: error: incompatible types: int cannot be converted to int64
    [javac]                         new PutField(() -> chr(10))
    [javac]                                                ^
...

Working on resolution. Looks like the character.java code has a regression in handling chr() method.

#99 Updated by Eugenie Lyzenko about 5 years ago

Task branch 3753a has been updated for review to revisions 11332.

Fixed the regression in character.java and merged fixes from 4010, 4066. Starting the runtime regression tests.

#100 Updated by Eugenie Lyzenko about 5 years ago

One main round of the runtime testing passed, started another one to exclude false failing tests. The CTRL-C part is OK.

#101 Updated by Eugenie Lyzenko about 5 years ago

Testing completed. No regression has been found. The results: 3753a_11332_32748a0_20190510_evl.zip.

So far the branch 3753a rev 11332 is ready to be merged to the trunk. Let me know please if I can do this now?

#102 Updated by Greg Shah about 5 years ago

You can merge to trunk.

#103 Updated by Eugenie Lyzenko about 5 years ago

Greg Shah wrote:

You can merge to trunk.

OK. Starting the merge process.

#104 Updated by Eugenie Lyzenko about 5 years ago

Branch 3753a was merged to trunk as revno 11307 then it was archived.

#105 Updated by Greg Shah about 5 years ago

TODO: We need to check the FWD runtime for references like this: a_string.getBytes(Charset.forName("ISO-8859-1")) (this example is from BinaryData). I suspect these kinds of cases need to be switched to honoring one of the CP* attributes (e.g. CPINTERNAL).

#106 Updated by Greg Shah about 5 years ago

Are 4GL system error messages translated, too? (we have been assuming YES, that CURRENT-LANGUAGE will select different message sources in OpenEdge)

The answer to this is YES. We will need to localize the messages.

Please note that we will need to find out if this will vary by current setting (at error time) of SESSION:CURRENT-LANGUAGE, by SESSION:CURRENT-LANGUAGE at the time the failing code was loaded or if it is global to the 4GL process (based on locale).

Fixing this will also be a good opportunity to create a better set of error helpers that can allow us to centralize the error processing.

#107 Updated by Greg Shah about 5 years ago

TODO: Stanislav notes the following:

When a longchar/clob value is assigned to a clob field, it is implicitly converted into the codepage of the target field. I suppose we'll have to handle it in RecordBuffer.invoke in the future.

We will need to handle this in assignments in both directions (e.g. also TO longchar) if we don't already do it properly.

I think BUFFER-COPY will need this too.

#108 Updated by Eric Faulhaber about 5 years ago

Greg Shah wrote:

TODO: Stanislav notes the following:

When a longchar/clob value is assigned to a clob field, it is implicitly converted into the codepage of the target field. I suppose we'll have to handle it in RecordBuffer.invoke in the future.

We will need to handle this in assignments in both directions (e.g. also TO longchar) if we don't already do it properly.

I think BUFFER-COPY will need this too.

FWD's implementation of BUFFER-COPY uses the RecordBuffer invocation handler for its individual fields, so if we implement it in the invocation handler, we are covered for BUFFER-COPY.

#109 Updated by Marian Edu almost 5 years ago

Hi @Greg, that task was mentioned in our last list so if you still need some test cases there can someone please make a list of what needs to be covered here?

Thanks

#110 Updated by Greg Shah almost 5 years ago

Marian Edu wrote:

Hi @Greg, that task was mentioned in our last list so if you still need some test cases there can someone please make a list of what needs to be covered here?

Thanks

Yes, we still need tests. Please see the items in these notes:

#111 Updated by Greg Shah over 4 years ago

Related to Feature #4378: properly handle clob/lonchar assignment, especially the implicit codepage conversion added

#112 Updated by Greg Shah over 4 years ago

Eugenie: I don't think that the NO-MAP I/O option was ever implemented in the runtime. It just looks like a stub.

#113 Updated by Greg Shah over 4 years ago

Eugenie: The NO-CONVERT and CONVERT I/O options are listed as full and stub. I think this is incorrect. I think NO-CONVERT should be runtime full and CONVERT should be runtime "basic" (first working implementation but needs full testing and compatibility). Is that correct?

#114 Updated by Eugenie Lyzenko over 4 years ago

Greg Shah wrote:

Eugenie: The NO-CONVERT and CONVERT I/O options are listed as full and stub. I think this is incorrect. I think NO-CONVERT should be runtime full and CONVERT should be runtime "basic" (first working implementation but needs full testing and compatibility). Is that correct?

Yes.

#115 Updated by Eugenie Lyzenko over 4 years ago

Greg Shah wrote:

Eugenie: I don't think that the NO-MAP I/O option was ever implemented in the runtime. It just looks like a stub.

Yes, I think MAP/NO-MAP should have conversion support and stubbed in runtime(because was not clear what we need to do with this option at runtime).

#116 Updated by Mihai Popescu-Tiganea over 4 years ago

Eugenie Lyzenko wrote:

4GL tests requirement
The following option's dependency need to be tested in 4GL environment from CURRENT-LANGUAGE settings(most of them are SESSION attributes):

CPINTERNAL - session_cpinternal.p

CPSTREAM - session_cpstream.p

CPCASE - session_cpcase.p

CPCOLL - session_cpcoll.p

CPLOG - session_cplog.p

CPPRINT - session_cpprint.p

CPRCODEIN - session_cprcodein.p

CPRCODEOUT - session_cprcodeout.p

CPTERM - session_cpterm.p

CHARSET - session_charset.p

CODEPAGE - rcode_info_codepage.p

The possible 4GL test scenario is:
1. Check the initial attribute value from the list.
2. Change the CURRENT-LANGUAGE.
3. Re-check the attribute value to find out if it is changed.

In a perfect world we need to know the behavior for both Linux and Windows 4GL system.

Tests requested on this note has been created.
For each attribute, the corespondent test file is mentioned.
All files are in same directory: /testcases/i18n.
Tests were made in Windows 4GL system.

#117 Updated by Mihai Popescu-Tiganea over 4 years ago

Constantin Asofiei wrote in #3753-10

All runtime needs to be implemented. Below is a list of what it needs to be done.
For FIX-CODEPAGE and GET-CODEPAGE, both conversion and runtime support required. Testing should be done for:

is the codepage copied from one longchar value to another? - codepage_copy.p

is the codepage involved in comparison operators? - codepage_operators.p

FIX-CODEPAGE with empty, unknown, non-empty longchar vars - codepage_fix.p

what if the codepage is already set? - codepage_fix.p

clob fields - can they work with fix-codepage and get-codepage? Can the codepage be set in some other way? - codepage_clob.p

editor with large-object (which can display a LONGCHAR val, with or without a codepage set) - how is the text displayed?

assigning a longchar to a char - is the codepage inherited from the rvalue? - codepage_character.p

assignment between longchars - the same, is the code page included in the assign? - codepage_longchar.p

is the codepage affecting the character bytes? For example: - codepage_char_bytes.p
[...]
Are lc1a and lc1b equal - is the final text in lc1b unaffected by the initial codepage in lc2? The idea here is to determine if the longchar's codepage is used when assigning a text to it (thus the reference text is kept in memory converted in the target codepage).

how are other statements which work with strings, affected? - codepage_asc.p, codepage_caps.p, codepage_lc.p, compare_abl.p

dlc/convmap.cp support:

we need to specify the list of known codepages (or default to the Java's available codepages) -

CODEPAGE-CONVERT, INPUT STREAM CONVERT work with this

what is the default codepage value - some explanation is in https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dvint/determining-the-code-page.html

We have develop tests suggested by Constantin in this note.
Tests are located in /testcases/i18n.
For:

CODEPAGE-CONVERT and INPUT STREAM CONVERT:

combinations of source and target codepages

xml files with accepted combinations are in /testcases/i18n/4gl
this files are generated by asc_support.p; chr_support.p; cpconvert_support.p; input_stream_support.p
when run this files with FWD_VERSION env. variable, the output director will be /testcases/i18n/fwd. Comparison of xml files from 4gl vs fwd directories should help.
we add ASC and CHR functions in tests

the source codepage is not the real text's codepage

test made for CODEPAGE-CONVERT - cpconvert_incorrect.p
for INPUT STREAM CONVERT, ASC and CHR, source codepage mismatch error is not throwable because this information seems to be only in longchar variable.

source/target codepages are not in the convmap.cp list

test made for ASC - asc_no_convmap.p
test made for CHR - * chr_no_convmap.p*
test made for CODEPAGE-CONVERT - cpconvert_no_convmap.p
test made for INPUT STREAM CONVERT - input_stream_no_convmap.p

#118 Updated by Greg Shah over 4 years ago

Assignee deleted (~~Eugenie Lyzenko~~)

#119 Updated by Mihai Popescu-Tiganea over 4 years ago

To run tests, follow steps should be done:

Create a database named tran in folder testcases/tran_man/db
Load definition and content from testcases/tran_man/data.
After that please disconnect from db.

Update or create database fwd with definition existed in testcases/db using files: fwd.db
Because fwd database is used in majority of tests do not forget to load users and domains from same folder.

Constantin Asofiei wrote on #3753-11:

Greg, see above for the runtime I18N - are these what you are looking for?
About the translation manager and translatable strings; we need tests to prove:

Are all strings without :U translatable? - see these https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dvref/-22--22character-string-literal.html

if string from code who is marked with quotes ex: “Test string” or apostrophe ex: ‘Test string’ is found in translation table then is translated
strings without :U that are not in translation table are not translated
strings with :U are not translated even if there are in translation table
translation table contain only translation of string from code
file u_option.p test this

Translation Database

How is the translation database saved? Is this a simple Progress DB? If so, what is the schema?

translation database is created by translation manager
it is a simple progress DB
after a dump operation the files – schema and content - are in testcases/tran_man/data folder
database is composed from 13 tables but it seems that 4 are important for performing translation without tranman
tables are: XL_instance, XL_Language, XL_string_info and XL_translation
translation mechanism could be:
add info regarding file and string length from this file who will be translated in XL_instance
add language in XL_Language
add string who will be translated in XL_string_info
add language and translated string XL_translation
detailed explications
- XL_instance contain lines like 11 14 "tran_man\static_string.p" 1 2458685.5095 no 17 1 "MESSAGE" "" ? 1
  - 11 is index from XL_string_info who indicate string who will be translated
  - 14 is this table index
  - tran_man\static_string.p procedure who contain string who will be translated
  - 17 number of characters of string
  - remaining information seems that is not important
- XL_Language contain lines like "German" 0 where is mention language of translation
- XL_string_info has line like 11 "transparent glass" 2458683.38797 "" "transparent glass"
  - 11 is index of this table - is mentioned also in XL_instance
  - "transparent glass" string who should be translated
  - we do not know why this text appear twice in every line
- XL_translation has line like 11 14 "German" "farbloses glas" 2458685.51116
  - 11 14 is string who should be translated from XL_instance
  - "German" is language of translation
  - "farbloses glas" is translated version of string

What is the character encoding of the database?

Code Page : UNDEFINED and Collation : TRANMAN

Can the source text and translated text be in different encodings?

compiled file (.r) contain translated strings and source string for all language specified in compile statement.
in this file there is only one codepage

Is there any functionality in OpenEdge that read the translation database at runtime or is this only for building the compile-time r-code text segments?

translation database is only used at compile-time for building r-code text segments.
I do not know other functionality who read translation database at runtime

How you can switch between translations, is this related to CURRENT-LANGUAGE?

switching between translations is possible using CURRENT-LANGUAGE
to use CURRENT-LANGUAGE you should compile code with languages option
language set in CURRENT-LANGUAGE should match with language specified in compile statement

Are 4GL system error messages translated, too? (we have been assuming YES, that CURRENT-LANGUAGE will select different message sources in OpenEdge)

System error messages are translated using PROMSG statement like PROMSGS = prolang/promsg.rus .
extension indicate language who will be used for error messages
we did not find any connection between CURRENT-LANGUAGE and PROMSGS statements

Are only standalone static strings translated? What if the string is in an expression, like "there is an error in program" + pname - the there is an error in program, will this string be translatable?

file static_string.p demonstrate that only static standalone strings are translated ( who are found at compile-time and are identical with one defined in XL_string_info

How does the 4GL behave if a string has a translation and others do not; is this something done at compile time, so a translation can't be done in future, or at runtime, and the .r will see any newly added translation?

if a string does not have a translation then is left untranslated
translation is made at compile-time not at runtime

Constantin Asofiei wrote on #3753-13:

Something else to add:

texts at the schema definition (labels, and so on) - are these translatable?

texts from schema definition are not translatable
in fwd db we add label ’Search’ to field customerAddress from customer table
file db_meta.p reveal this label but is not translated

#120 Updated by Mihai Popescu-Tiganea over 4 years ago

We add tests for translation using Chinese language.
For Chinese language we have to use UTF-8 code page for all involved entities.
To run this tests following actions must be taken:

create a database UTF-8 in testcases/tran_man/db named tran.db
- starting from Data Dictionary -> Create Database -> A Copy of Some Other Database
- use empty.db from DLC/prolang/utf
load definitions and content from testcases/tran_man/data
start a session using UTF-8 code page

Procedures who use test files are located in testcases/tran_man/run
Test files are compiled using compile statement with different languages.
Compiled code is saved in testcases/tran_man/obj/tran_man.
Compiled code is run using languages mentioned on compile statement and translation is checked.

#121 Updated by Marian Edu about 4 years ago

Some things we've found and tried to fix soo far while implementing OO base classes...
- codepage conversion support is not using any 'convmap table', not sure if there are any plans for custom codepage conversion at all or just use what is available in JAVA
- 'iso8859-1' seems to be considered as the 4GL default
- CHR, ASC are not double byte enabled, codepages defaults and validation incomplete

Hopefully we will get some of those fixes so our tests on OO implementation passes... there are probably more like keyboard, stream, code codepage support.

Ah, as a side note we've found a strange issue with conversion - the backslash escape in strings present in source code, the backslash simply disappears in the generated Java code, Otherwise the 4GL escape character (tilde) seems to work just fine.

#122 Updated by Greg Shah almost 4 years ago

Ah, as a side note we've found a strange issue with conversion - the backslash escape in strings present in source code, the backslash simply disappears in the generated Java code, Otherwise the 4GL escape character (tilde) seems to work just fine.

In cfg/p2j.cfg.xml we have the opsys parameter. If it is set to UNIX, then we act like the 4GL compiler on Linux/UNIX and honor both the \ and ~ as escape chars. If it is set to WIN32, then we act like the 4GL compiler on Windows and we only honor ~. Perhaps this is what you are seeing.

#123 Updated by Greg Shah almost 4 years ago

Related to Feature #4761: I18N phase 3 added

#124 Updated by Greg Shah almost 4 years ago

% Done changed from 0 to 100
Status changed from WIP to Closed

The support from this task is already in trunk. The remaining items will be tracked in #4761.

#125 Updated by Greg Shah about 2 years ago

Related to Feature #6451: I18N phase 4 added

Also available in: Atom PDF

	Related to Base Language - Feature #3292: i18n improvements	Closed
	Related to Base Language - Feature #3817: create resource bundles from string literals and implement optional support for setting values from the translation manager database	Closed
	Related to Base Language - Feature #4378: properly handle clob/lonchar assignment, especially the implicit codepage conversion	Closed
	Related to Base Language - Feature #4761: I18N phase 3	New
	Related to Base Language - Feature #6451: I18N phase 4	New

Project

General

Profile

FWD » Core Development » Base Language

Issues

Custom queries

Feature #3753

I18N additions

History

#1 Updated by Greg Shah over 5 years ago

#2 Updated by Greg Shah over 5 years ago

#3 Updated by Greg Shah over 5 years ago

#4 Updated by Constantin Asofiei over 5 years ago

#5 Updated by Constantin Asofiei over 5 years ago

#6 Updated by Constantin Asofiei over 5 years ago

#7 Updated by Greg Shah over 5 years ago

#8 Updated by Greg Shah over 5 years ago

#9 Updated by Eugenie Lyzenko over 5 years ago

#10 Updated by Constantin Asofiei over 5 years ago

#11 Updated by Constantin Asofiei about 5 years ago

#12 Updated by Greg Shah about 5 years ago

#13 Updated by Constantin Asofiei about 5 years ago

#14 Updated by Greg Shah about 5 years ago

#15 Updated by Eugenie Lyzenko about 5 years ago

#16 Updated by Eugenie Lyzenko about 5 years ago

#17 Updated by Constantin Asofiei about 5 years ago

#18 Updated by Eugenie Lyzenko about 5 years ago

#19 Updated by Eugenie Lyzenko about 5 years ago

#20 Updated by Eugenie Lyzenko about 5 years ago

#21 Updated by Greg Shah about 5 years ago

#22 Updated by Eugenie Lyzenko about 5 years ago

#23 Updated by Eugenie Lyzenko about 5 years ago

#24 Updated by Greg Shah about 5 years ago

#25 Updated by Greg Shah about 5 years ago

#26 Updated by Eugenie Lyzenko about 5 years ago

#27 Updated by Eugenie Lyzenko about 5 years ago

#28 Updated by Eugenie Lyzenko about 5 years ago

#29 Updated by Eugenie Lyzenko about 5 years ago

#30 Updated by Eugenie Lyzenko about 5 years ago

#31 Updated by Greg Shah about 5 years ago

#32 Updated by Eugenie Lyzenko about 5 years ago

#33 Updated by Eugenie Lyzenko about 5 years ago

#34 Updated by Greg Shah about 5 years ago

#35 Updated by Greg Shah about 5 years ago

#36 Updated by Eugenie Lyzenko about 5 years ago

#37 Updated by Eugenie Lyzenko about 5 years ago

#38 Updated by Greg Shah about 5 years ago

#39 Updated by Greg Shah about 5 years ago

#40 Updated by Eugenie Lyzenko about 5 years ago

#41 Updated by Greg Shah about 5 years ago

#43 Updated by Eugenie Lyzenko about 5 years ago

#44 Updated by Eugenie Lyzenko about 5 years ago

#45 Updated by Eugenie Lyzenko about 5 years ago

#46 Updated by Eugenie Lyzenko about 5 years ago

#47 Updated by Eugenie Lyzenko about 5 years ago

#48 Updated by Eugenie Lyzenko about 5 years ago

#49 Updated by Eugenie Lyzenko about 5 years ago

#50 Updated by Eugenie Lyzenko about 5 years ago

#51 Updated by Eugenie Lyzenko about 5 years ago

#52 Updated by Eugenie Lyzenko about 5 years ago

#53 Updated by Eugenie Lyzenko about 5 years ago

#54 Updated by Eugenie Lyzenko about 5 years ago

#55 Updated by Greg Shah about 5 years ago

#56 Updated by Eugenie Lyzenko about 5 years ago

#57 Updated by Eugenie Lyzenko about 5 years ago

#58 Updated by Eugenie Lyzenko about 5 years ago

#59 Updated by Eugenie Lyzenko about 5 years ago

#60 Updated by Eugenie Lyzenko about 5 years ago

#61 Updated by Eugenie Lyzenko about 5 years ago

#62 Updated by Eugenie Lyzenko about 5 years ago

#63 Updated by Greg Shah about 5 years ago

#64 Updated by Eugenie Lyzenko about 5 years ago

#65 Updated by Eugenie Lyzenko about 5 years ago

#66 Updated by Eugenie Lyzenko about 5 years ago

#67 Updated by Greg Shah about 5 years ago

#68 Updated by Eugenie Lyzenko about 5 years ago

#69 Updated by Greg Shah about 5 years ago

#70 Updated by Eugenie Lyzenko about 5 years ago

#71 Updated by Eugenie Lyzenko about 5 years ago

#72 Updated by Eugenie Lyzenko about 5 years ago