Feature #27: implement a compatible ENCODE built-in function - Base Language - Golden Code Redmine

Feature #27

implement a compatible ENCODE built-in function

Added by Greg Shah almost 19 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Greg Shah

Target version:

FWD - Cleanup and Stablization for Server Features

Start date:

08/17/2013

Due date:

10/25/2013

% Done:

Estimated time:

160.00 h

billable:

vendor_id:

GCD

version_reported:

version_resolved:

progress_knowledge_base_article_21685_one_way_uniqueness_of_progress_4gl_encode_function.pdf (56.5 KB) Greg Shah, 05/02/2013 02:16 PM

progress_knowledge_base_article_10715_what_is_the_algorith_used_by_encode_4gl_function.pdf (44 KB) Greg Shah, 05/02/2013 02:16 PM

stack_overflow_article_11742310_progress_10.1C_4gl_encode_function.pdf (148 KB) Greg Shah, 05/02/2013 02:16 PM

crypto_pet_peeves_hashing_encoding_its_all_the_same_right_patrick_toomey_neohapsis_20080825.pdf (479 KB) Greg Shah, 05/02/2013 03:16 PM

crc_v3.txt (87 KB) Greg Shah, 09/03/2014 12:59 PM

stack_overflow_11742310_progress_10.1c_4gl_encode_function_updated_20140902.pdf - Updated version of previously linked article. (178 KB) Greg Shah, 09/03/2014 01:14 PM

stack_exchange_how_can_i_reverse_engineer_a_hash_code_4gl_encode_20140902.pdf - New article found about Progress encode. (146 KB) Greg Shah, 09/03/2014 01:14 PM

lgpl_github_project_for_progress_encode_20140902.pdf - Main Github project page for LGPL C# open source encode implementation. (125 KB) Greg Shah, 09/03/2014 01:14 PM

lgpl_c_sharp_encode_source_20140902.pdf - Primary source code for LGPL C# open source encode implementation. (135 KB) Greg Shah, 09/03/2014 01:14 PM

CustomCRC16Encoder.java - compatible Java ENCODE implementation (NOT integrated into P2J) (23.2 KB) Greg Shah, 11/04/2014 03:01 PM

encode_binary_data_input.p (1.32 KB) Greg Shah, 11/04/2014 03:02 PM

encode_text_input.p (537 Bytes) Greg Shah, 11/04/2014 03:02 PM

StringGenerator.java (3.95 KB) Greg Shah, 11/04/2014 03:02 PM

ges_upd20141107a.zip (680 KB) Greg Shah, 11/07/2014 12:10 PM

ges_upd20141110a.zip (682 KB) Greg Shah, 11/10/2014 02:27 PM

ges_upd20141110b.zip (682 KB) Greg Shah, 11/10/2014 03:17 PM

ges_upd20141111a.zip (684 KB) Greg Shah, 11/11/2014 11:22 AM

ges_upd20141112a.zip (684 KB) Greg Shah, 11/12/2014 03:12 PM

History

#1 Updated by Nick Saxon almost 19 years ago

@22028 - the need to reimplement ENCODE in P2J:

In MAJIC, ENCODE is used to verify users passwords. The encoded passwords are kept in the MAJIC database. The conversion choices are:
- reimplement ENCODE so that a 100% compatible Java implementation can be used in the converted P2J application
- make the users chose new passwords so that no 100% compatibility is required
- change the existing MAJIC application and make it collect the source passwords before migration and then encrypt those password by regular P2J means.

The ENCODE reimplementation was seen as the preferred solution.

#2 Updated by Nick Saxon almost 19 years ago

@22029 - facts about Progress ENCODE:

1. Progress' builtin function ENCODE is a one way lossy encryption function which takes an arbitrarily long input and produces a 16-character encoding using the A-Za-z alphabet (52 characters).

2. All characters on input are taken into consideration and the output varies with the variations of input.

3. The result looks like a pretty solid implementation of a cryptographic hash function. For instance, encode("aaaaaaaabbbbbbbb") != encode("bbbbbbbbaaaaaaaa") etc.

#3 Updated by Nick Saxon almost 19 years ago

@22030 - approaches to the reimplementation of ENCODE:

1. No reverse engineering of any kind is allowed as it is the most likely against the Progress license terms.

2. Using ENCODE to produce samples of the output does not violate the license.

3. ENCODE will be used to produce a large number of samples and those samples will be used for comparisons with the output of some well known encryprion and hashing functions. The expected outcome is a pattern that helps in guessing the applied algorithm.

#4 Updated by Nick Saxon almost 19 years ago

@22031 - first findings:

When producing a 16M records file with ENCODE result using all binary combinations from 00000000 through 00ffffff, I've found that ENCODE is sensitive to the embedded nulls.

The conclusion is that the implementation is based on a function call that operates the null terminated strings.

Instead of the original range, I've modified the loop to run from 01010101 through 01ffffff and bypass all 00s.

The file name is enc4-1.txt on phantom, /home/nvs.
Below is the progress code used to generate the file:

def var i as int.
def var j as int init 1.
def var k as int init 1.
def var l as int init 1.
def var m as int init 1.
def var c1 as char.
def var c2 as char.
def var c3 as char.
def var c4 as char.
def var c  as char.

output to enc4-1.txt.

do i = 1 to 16777215:
   c1 = chr(j).
   c2 = chr(k).
   c3 = chr(l).
   c4 = chr(m).
   c = c1 + c2 + c3 + c4.
   put j format "999" space k format "999" space l format "999" space  
       m format "999" space encode(c) format "x(16)" skip.

   m = m + 1.
   if m = 256 
   then do:
      m = 1.
      l = l + 1.
      if l = 256
      then do:
         l = 1.
         k = k + 1.
         if k = 256
         then 
            leave.
      end.
   end.
end.

output close.

#5 Updated by Nick Saxon almost 19 years ago

@22032 - statistical analysis of ENCODE results:

The following code is a tool used to collect some statistics about the ENCODE results.

#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

long stats[52][16];
int main(int argc, char* argv[])
{
   char* a52 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
   FILE* f;
   int i = 0;
   int j, k;
   char rec[34];
   char c;

   f = fopen(argv[1], "r");
   if (f == NULL)
   {
      printf("could not open %s\\n", argv[1]);
      return 1;
   }

   while(fread(rec, 33, 1, f) == 1)
   {
      i++;
      if (i % 1000000 == 0)
         printf("%u records read\\n", i);

      for (j = 0; j < 16; j ++)
      {
         c = rec[16 + j];
         k = strchr(a52, c) - a52;
         stats[k][j] ++;
      }
   }

   printf("Total %u records read\\n", i);

   // print statistics
   for (i = 0; i < 52; i ++)
      printf("%c   %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u\\n",
             a52[i],
             stats[i][0]/10000,
             stats[i][1]/10000,
             stats[i][2]/10000,
             stats[i][3]/10000,
             stats[i][4]/10000,
             stats[i][5]/10000,
             stats[i][6]/10000,
             stats[i][7]/10000,
             stats[i][8]/10000,
             stats[i][9]/10000,
             stats[i][10]/10000,
             stats[i][11]/10000,
             stats[i][12]/10000,
             stats[i][13]/10000,
             stats[i][14]/10000,
             stats[i][15]/10000);

   return 0;
}

#6 Updated by Nick Saxon almost 19 years ago

@22033 - statistical analysis of ENCODE results:

This program prints a table where thr rows represents characters from the ENCODE output alphabet, A-Za-z, and the columns represents the output string positions, from 0 to 15.

The table cells are character counts for this character and this output position. In this original version, the program prints the counts divided by 10,000 for improved readability.

#7 Updated by Nick Saxon almost 19 years ago

@22034 - statistics, part 1 A-Z:

Histogram of ENCODE values from enc4-1.txt
A   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
B   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
C   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
D   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
E   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
F   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
G   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
H   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
I   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
J   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
K   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
L   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
M   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
N   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
O   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
P   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
Q   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
R   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
S   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
T   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
U   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
V   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
W   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
X   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
Y   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
Z   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012

#8 Updated by Nick Saxon almost 19 years ago

@22035 - statistics, part 2 z-a:

a   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
b   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
c   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
d   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
e   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
f   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
g   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
h   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
i   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
j   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
k   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
l   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
m   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
n   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
o   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
p   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
q   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
r   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
s   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
t   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
u   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
v   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
w   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
x   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
y   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
z   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012

#9 Updated by Nick Saxon almost 19 years ago

@22036 - statistics, part 3 - conclusions:

1. All characters are evenly distributed across string positions.

2. The alphabet is used unevenly. There are four groups:
- the least used characters (the score is 12): A-Z and q-z
- the most used characters (the score is 116): a-d, i-l
- the high use characters (the score is 45): f,h,n,p
- the low use characters (the score is 19): e,g,m,o

#10 Updated by Nick Saxon almost 19 years ago

@22037 - assumptions about the Progress implementation:

There is no exact information about the time frame the ENCODE function first appeared in Progress, but due to compatibility considerations. once it has been there, it could not be reimplemented using a different (better) hashing algorithm.

Assuming it's appeared around 1990-1991, what are the choices of good hashing functions?

1. UNIX has had crypt() function call since the day one. Although it is a decent one way encryption function, it suffers from one limitation: only the first 8 characters of the input are significant. Some additional manipulations are necessary to make it work adequately for longer inputs. There is also an inconvenience of transforming 11 bytes of the crypt() output into 16-byte encoding plus the need to guess what value of salt is to be used (4K variations).

2. MD2, M4, and, possibly, MD5. These are cryptographic hash functions, taking any number of bytes on input and producing 16 bytes of binary output. The output is immediately usable as indices into a translation table that produces the encoding out of the 52 character alphabet.

3. A newer modification of crypt(), known as crypt-md5. Takes any number of characters on input and produces a 22-character output. The same inconvenience to transform 22 bytes into 16 bytes is here.
-

#11 Updated by Nick Saxon almost 19 years ago

@22038 - trying crypt(), part 1:

The crypt() function call has one obvious advantage: it's been with UNIX from the origin. The questions with crypt() are:

- what value to use as the salt?

- how to transform 11 bytes from the alphabet A-Za-z0-9./ (64 characters) into the alphabet a-Za-z (52 characters)?

- how to handle long input strings?

#12 Updated by Nick Saxon almost 19 years ago

@22039 - trying crypt(), part 2:

I used 4-byte input to separate the issues.

crypt() output can be taken as whole bytes and used as 8 indices for a half of ENCODE output emulation.

The following program ran the simulation and printed the distribution of characters for comparison.

#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

long stats[64][11];
int main(int argc, char* argv[])
{
   char* a64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789./";
   long i = 0;
   int  k, j;
   int  i1 = 1, i2 = 1, i3 = 1, i4 = 1;
   char inp[5];
   char* crpt;
   char c;

   inp[4] = 0;

   /* main loop; generating 4-byte input strings */

   for (i = 1; i < 16777215; i ++)
   {
      if (i % 1000000 == 0)
         printf("%u records generated\\n", i);

      inp[0] = (char)i1;
      inp[1] = (char)i2;
      inp[2] = (char)i3;
      inp[3] = (char)i4;

      crpt = crypt(inp, "./") + 2;

      i4++;
      if (i4 == 256) 
      {
         i4 = 1;
         i3++;
         if (i3 == 256)
         {
            i3 = 1;
            i2++;
            if (i2 == 256)
               break;
         }
      }

      for (j = 0; j < 11; j ++)
      {
         c = *(crpt + j);
         k = strchr(a64, c) - a64;
         stats[k][j] ++;
      }
   }

   printf("Total %u records generated\\n", i);

   // print statistics
   for (i = 0; i < 64; i ++)
      printf("%c   %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u\\n",
             a64[i],
             stats[i][0]/10000,
             stats[i][1]/10000,
             stats[i][2]/10000,
             stats[i][3]/10000,
             stats[i][4]/10000,
             stats[i][5]/10000,
             stats[i][6]/10000,
             stats[i][7]/10000,
             stats[i][8]/10000,
             stats[i][9]/10000,
             stats[i][10]/10000);
   return 0;
}

#13 Updated by Nick Saxon almost 19 years ago

@22040 - trying crypt(), part 3:

The collected statistics were (A-Z):
A   0103  0000  0000  0025  0103  0000  0000  0025  0103  0000  0000  0025  0103  0000  0000  0025
B   0104  0000  0000  0000  0103  0000  0000  0000  0103  0000  0000  0000  0103  0000  0000  0026
C   0103  0000  0024  0000  0103  0000  0024  0000  0103  0000  0024  0000  0103  0000  0102  0025
D   0103  0000  0105  0000  0103  0000  0105  0000  0103  0000  0105  0000  0103  0000  0104  0026
E   0078  0000  0000  0000  0077  0000  0000  0000  0077  0000  0000  0000  0077  0000  0000  0025
F   0000  0000  0000  0000  0000  0000  0000  0000  0000  0000  0000  0000  0000  0000  0000  0026
G   0000  0000  0024  0000  0000  0000  0024  0000  0000  0000  0024  0000  0000  0000  0000  0025
H   0000  0000  0105  0025  0000  0000  0105  0025  0000  0000  0105  0026  0000  0000  0000  0025
I   0000  0013  0000  0025  0000  0013  0000  0026  0000  0013  0000  0025  0000  0027  0000  0025
J   0000  0068  0000  0025  0000  0068  0000  0025  0000  0068  0000  0025  0000  0055  0000  0025
K   0000  0102  0024  0025  0000  0103  0024  0025  0000  0103  0024  0025  0000  0110  0102  0026
L   0000  0075  0105  0025  0000  0075  0105  0026  0000  0075  0105  0025  0000  0082  0104  0025
M   0000  0102  0000  0025  0000  0103  0000  0025  0000  0102  0000  0026  0000  0109  0000  0026
N   0000  0076  0000  0025  0000  0075  0000  0026  0000  0076  0000  0025  0000  0055  0000  0025
O   0000  0000  0019  0026  0000  0000  0019  0025  0000  0000  0019  0025  0000  0000  0000  0025
P   0000  0000  0084  0025  0000  0000  0084  0025  0000  0000  0084  0025  0000  0000  0000  0025
Q   0000  0000  0000  0026  0000  0000  0000  0025  0000  0000  0000  0026  0000  0000  0000  0025
R   0000  0000  0000  0026  0000  0000  0000  0025  0000  0000  0000  0026  0000  0000  0000  0025
S   0000  0000  0009  0025  0000  0000  0009  0025  0000  0000  0009  0025  0000  0000  0051  0025
T   0000  0000  0041  0025  0000  0000  0041  0025  0000  0000  0041  0025  0000  0000  0052  0025
U   0000  0000  0000  0051  0000  0000  0000  0051  0000  0000  0000  0051  0000  0000  0000  0025
V   0000  0000  0000  0051  0000  0000  0000  0051  0000  0000  0000  0051  0000  0000  0000  0025
W   0000  0000  0009  0051  0000  0000  0009  0051  0000  0000  0009  0051  0000  0000  0000  0025
X   0000  0000  0042  0052  0000  0000  0042  0051  0000  0000  0042  0051  0000  0000  0000  0025
Y   0000  0011  0000  0051  0000  0012  0000  0051  0000  0012  0000  0051  0000  0024  0000  0025
Z   0000  0060  0000  0051  0000  0060  0000  0051  0000  0060  0000  0051  0000  0048  0000  0026

#14 Updated by Nick Saxon almost 19 years ago

@22041 - trying crypt(), part 4:

The collected statistics were (a-z):
a   0000  0090  0024  0051  0000  0090  0024  0052  0000  0090  0024  0051  0000  0097  0051  0051
b   0000  0067  0105  0077  0000  0066  0104  0077  0000  0067  0105  0077  0000  0072  0052  0051
c   0000  0103  0000  0077  0000  0102  0000  0077  0000  0103  0000  0077  0000  0121  0000  0051
d   0000  0127  0000  0077  0000  0127  0000  0078  0000  0127  0000  0077  0000  0097  0000  0051
e   0000  0090  0038  0078  0000  0091  0038  0077  0000  0091  0039  0077  0000  0097  0154  0052
f   0000  0066  0168  0077  0000  0066  0169  0077  0000  0066  0168  0077  0000  0073  0156  0051
g   0000  0090  0000  0051  0000  0090  0000  0052  0000  0091  0000  0052  0000  0097  0000  0052
h   0000  0066  0000  0025  0000  0067  0000  0026  0000  0067  0000  0025  0000  0048  0000  0051
i   0000  0000  0038  0026  0000  0000  0039  0025  0000  0000  0038  0025  0000  0000  0103  0051
j   0000  0000  0168  0025  0000  0000  0168  0025  0000  0000  0168  0025  0000  0000  0103  0051
k   0000  0000  0000  0025  0000  0000  0000  0025  0000  0000  0000  0025  0000  0000  0000  0051
l   0051  0000  0000  0025  0051  0000  0000  0025  0051  0000  0000  0025  0052  0000  0000  0052
m   0103  0000  0024  0025  0103  0000  0024  0026  0103  0000  0023  0025  0103  0000  0102  0025
n   0103  0000  0105  0025  0103  0000  0104  0025  0103  0000  0104  0025  0103  0000  0104  0025
o   0051  0000  0000  0025  0051  0000  0000  0025  0052  0000  0000  0025  0051  0000  0000  0025
p   0000  0000  0000  0025  0000  0000  0000  0025  0000  0000  0000  0026  0000  0000  0000  0026
q   0077  0000  0024  0025  0077  0000  0024  0025  0077  0000  0024  0025  0077  0000  0000  0025
r   0103  0000  0105  0025  0103  0000  0105  0025  0103  0000  0105  0025  0104  0000  0000  0025
s   0103  0013  0000  0026  0103  0013  0000  0025  0103  0013  0000  0026  0103  0027  0000  0026
t   0103  0069  0000  0026  0103  0068  0000  0026  0103  0068  0000  0026  0103  0054  0000  0025
u   0103  0103  0024  0026  0104  0103  0024  0025  0103  0103  0024  0025  0103  0109  0102  0025
v   0103  0076  0105  0026  0103  0075  0105  0026  0103  0075  0105  0026  0103  0082  0104  0025
w   0077  0102  0000  0025  0077  0103  0000  0025  0078  0103  0000  0025  0077  0110  0000  0025
x   0000  0075  0000  0025  0000  0075  0000  0026  0000  0075  0000  0026  0000  0055  0000  0025
y   0077  0000  0024  0026  0077  0000  0024  0025  0077  0000  0024  0025  0077  0000  0052  0025
z   0103  0000  0104  0025  0103  0000  0104  0026  0103  0000  0105  0025  0103  0000  0051  0025

#15 Updated by Nick Saxon almost 19 years ago

@22042 - trying crypt(), part 5:

The distribution is highly uneven across the output positions, which is no surprise since every two high order bits in every byte of crypt() output show a regular pattern (tested with more than one salt). Only the rest 6 bits of every byte are truely hashed. So, the next thing to try is a method of converting 66 hashed bits out of 11 bytes of crypt() output.

The crypt output is made of A-Za-z0-9./
By simply throwing away two high order bits, we get colliding 6 bits. So, instead, we use indices into this range of 64 characters, which are 6-bit wide, of course. Additional question arises: is the order of characters important?

The answer is yes. A-Z-az0-9./ and a-zA-Z0-9./ would yield different results. We get another degree of freedom (max 64! permutations of this input translation table).

#16 Updated by Nick Saxon almost 19 years ago

@22044 - trying crypt(), part 6:

Nevertheless, it's important to check how close the character distribution would be for a typical implementation of crypt(). Below is the code.

/* This program generates simulated ENCODE results (by using tran2 method)
   and calculates the histogram.
*/
#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

char* tran2(char* cr);
char* tran4(char* cr1, char* cr2);
int a64i(char a64c);

long stats[52][16];
int main(int argc, char* argv[])
{
   char* a52 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
   long i = 0;
   int  k, j;
   int  i1 = 1, i2 = 1, i3 = 1, i4 = 1;
   char inp[5];
   char* crpt;
   char* crp2;
   char crypted[23];
   char* encd;
   char c;

   inp[4] = 0;

   /* main loop; generating 4-byte input strings */

   for (i = 1; i < 16777215; i ++)
   {
      if (i % 1000000 == 0)
         fprintf(stderr, "%u records generated\\n", i);

      inp[0] = (char)i1;
      inp[1] = (char)i2;
      inp[2] = (char)i3;
      inp[3] = (char)i4;

      crpt = crypt(inp, "./") + 2;
      crp2 = crypt(inp, "/.") + 2;
      strcpy(crypted, crpt);
      strcat(crypted, crp2);
//    encd = tran2(crpt);
      encd = tran4(crpt, crp2);

      i4++;
      if (i4 == 256) 
      {
         i4 = 1;
         i3++;
         if (i3 == 256)
         {
            i3 = 1;
            i2++;
            if (i2 == 256)
               break;
         }
      }

      for (j = 0; j < 16; j ++)
      {
         c = *(encd + j);
         k = strchr(a52, c) - a52;
         stats[k][j] ++;
      }
   }

   fprintf(stderr, "Total %u records generated\\n", i);

   // print statistics
   // print statistics
   for (i = 0; i < 52; i ++)
      printf("%c   %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u  %04u\\n",
             a52[i],
             stats[i][0]/10000,
             stats[i][1]/10000,
             stats[i][2]/10000,
             stats[i][3]/10000,
             stats[i][4]/10000,
             stats[i][5]/10000,
             stats[i][6]/10000,
             stats[i][7]/10000,
             stats[i][8]/10000,
             stats[i][9]/10000,
             stats[i][10]/10000,
             stats[i][11]/10000,
             stats[i][12]/10000,
             stats[i][13]/10000,
             stats[i][14]/10000,
             stats[i][15]/10000);

   return 0;
}

/* converts a byte of crypt output into 6-bit index into the 64-char table */
int a64i(char a64c)
{

   static char* a64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                      "abcdefghijklmnopqrstuvwxyz" 
                      "0123456789./";
   return index(a64, a64c) - a64;
}

#17 Updated by Nick Saxon almost 19 years ago

@22045 - trying crypt(), part 7:

/*
   transformation 4: 22 bytes on input provide 132 significant bits
   (every bytes sypplies 6 least significant bits).
   128 bits (64 bits from every 66) are used as 16 8-bit indices into a 
   translation table. The latter contains:
   - 2 groups of A-Z
   - 2 groups of q-z
   - 3 groups of e,g,m,o
   - 7 groups of f,h,n,p,
   - 18 groups of a,b,c,d,i,j,k,l
*/
char* tran4(char* cr1, char* cr2)
{
   static char tran4s[17];
   static char* tab4 = 
                       "ABCD" "EFGH" "IJKL" "MNOP" 
                       "QRST" "UVWX" "YZqr" "stuv" 
                       "wxyz" "egmo" "egmo" "egmo" 
                       "ABCD" "EFGH" "IJKL" "MNOP" 
                       "QRST" "UVWX" "YZqr" "stuv" 
                       "wxyz" "fhnp" "fhnp" "fhnp" 
                       "fhnp" "fhnp" "fhnp" "fhnp" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                     ;
   char *t = tran4s;
   int i, j, ind;
   unsigned long u;

   unsigned char* c;

   /* 3 groups of 4 bytes from input 1 (including terminating 0) */
   /* converting every group into 3 bytes but using only 8 bytes */
   for (c = (unsigned char*)cr1, i = 0; i < 3; i ++)
   {
      u = ((unsigned long)(a64i(*c)) & 0x0000003f) << 18;
      c ++;
      u += ((unsigned long)(a64i(*c)) & 0x0000003f) << 12;
      c ++;
      u += ((unsigned long)(a64i(*c)) & 0x0000003f) << 6;
      c ++;
      u += (unsigned long)(a64i(*c)) & 0x0000003f;
      c++;

      /* converting 32 bits into 24 bits */
      for (j = 0; j < 3; j ++)
      {
         ind = (int)(u & 0x000000ff);
         *(t + 2 - j) = *(tab4 + ind);
         u >>= 8;
      }
      t += 3;
   }
   t = tran4s + 8;   

   /* 3 groups of 4 bytes from input 2 (including terminating 0) */
   /* converting every group into 3 bytes but using only 8 bytes */
   for (c = (unsigned char*)cr1, i = 0; i < 3; i ++)
   {
      u = ((unsigned long)(a64i(*c)) & 0x0000003f) << 18;
      c ++;
      u += ((unsigned long)(a64i(*c)) & 0x0000003f) << 12;
      c ++;
      u += ((unsigned long)(a64i(*c)) & 0x0000003f) << 6;
      c ++;
      u += (unsigned long)(a64i(*c)) & 0x0000003f;
      c++;

      /* converting 32 bits into 24 bits */
      for (j = 0; j < 3; j ++)
      {
         ind = (int)(u & 0x000000ff);
         *(t + 2 - j) = *(tab4 + ind);
         u >>= 8;
      }
      t += 3;
   }

   tran4s[16] = 0;

   return tran4s;
}

#18 Updated by Nick Saxon almost 19 years ago

@22046 - trying crypt(), part 8:

This is how the statistics looked like (A-Z):
A   0012  0012  0012  0013  0013  0013  0012  0012  0012  0012  0012  0013  0013  0013  0012  0012
B   0012  0012  0012  0012  0012  0012  0012  0013  0012  0012  0012  0012  0012  0012  0012  0013
C   0012  0012  0012  0012  0012  0013  0013  0012  0012  0012  0012  0012  0012  0013  0013  0012
D   0013  0013  0013  0012  0013  0012  0012  0013  0013  0013  0013  0012  0013  0012  0012  0013
E   0012  0013  0012  0013  0012  0012  0013  0012  0012  0013  0012  0013  0012  0012  0013  0012
F   0012  0012  0013  0012  0012  0013  0012  0013  0012  0012  0013  0012  0012  0013  0012  0013
G   0012  0012  0012  0013  0012  0012  0012  0012  0012  0012  0012  0013  0012  0012  0012  0012
H   0013  0012  0012  0013  0012  0012  0012  0012  0013  0012  0012  0013  0012  0012  0012  0012
I   0013  0013  0013  0012  0012  0012  0012  0013  0013  0013  0013  0012  0012  0012  0012  0013
J   0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012  0012
K   0012  0012  0012  0012  0012  0013  0013  0013  0012  0012  0012  0012  0012  0013  0013  0013
L   0012  0013  0012  0012  0012  0012  0013  0012  0012  0013  0012  0012  0012  0012  0013  0012
M   0012  0012  0012  0013  0012  0012  0012  0013  0012  0012  0012  0013  0012  0012  0012  0013
N   0013  0013  0012  0012  0012  0013  0013  0012  0013  0013  0012  0012  0012  0013  0013  0012
O   0012  0012  0012  0013  0012  0012  0013  0013  0012  0012  0012  0013  0012  0012  0013  0013
P   0012  0013  0012  0013  0012  0012  0013  0012  0012  0013  0012  0013  0012  0012  0013  0012
Q   0013  0012  0012  0013  0013  0012  0012  0013  0013  0012  0012  0013  0013  0012  0012  0013
R   0013  0012  0013  0012  0013  0012  0012  0013  0013  0012  0013  0012  0013  0012  0012  0013
S   0013  0013  0013  0013  0012  0012  0013  0012  0013  0013  0013  0013  0012  0012  0013  0012
T   0012  0012  0012  0013  0012  0012  0012  0012  0012  0012  0012  0013  0012  0012  0012  0012
U   0012  0013  0012  0012  0012  0012  0013  0012  0012  0013  0012  0012  0012  0012  0013  0012
V   0012  0012  0013  0012  0013  0013  0012  0012  0012  0012  0013  0012  0013  0013  0012  0012
W   0013  0012  0012  0012  0012  0012  0013  0012  0013  0012  0012  0012  0012  0012  0013  0012
X   0012  0012  0012  0013  0012  0013  0012  0012  0012  0012  0012  0013  0012  0013  0012  0012
Y   0012  0013  0012  0013  0012  0012  0013  0012  0012  0013  0012  0013  0012  0012  0013  0012
Z   0012  0012  0013  0012  0012  0013  0012  0012  0012  0012  0013  0012  0012  0013  0012  0012

#19 Updated by Nick Saxon almost 19 years ago

@22047 - trying crypt(), part 9:

This is how the statistics looked like (a-z):
a   0115  0116  0117  0116  0115  0116  0116  0116  0115  0116  0117  0116  0115  0116  0116  0116
b   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
c   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
d   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
e   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
f   0045  0045  0045  0045  0045  0045  0044  0045  0045  0045  0045  0045  0045  0045  0044  0045
g   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
h   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
i   0117  0116  0116  0116  0116  0116  0116  0116  0117  0116  0116  0116  0116  0116  0116  0116
j   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
k   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
l   0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116  0116
m   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
n   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
o   0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019  0019
p   0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045  0045
q   0012  0012  0013  0013  0013  0012  0013  0013  0012  0012  0013  0013  0013  0012  0013  0013
r   0012  0013  0013  0012  0013  0013  0013  0013  0012  0013  0013  0012  0013  0013  0013  0013
s   0012  0013  0012  0013  0013  0012  0013  0012  0012  0013  0012  0013  0013  0012  0013  0012
t   0012  0012  0012  0012  0012  0012  0012  0013  0012  0012  0012  0012  0012  0012  0012  0013
u   0013  0012  0012  0012  0012  0012  0012  0012  0013  0012  0012  0012  0012  0012  0012  0012
v   0012  0012  0012  0013  0012  0012  0013  0012  0012  0012  0012  0013  0012  0012  0013  0012
w   0012  0012  0013  0012  0012  0012  0012  0012  0012  0012  0013  0012  0012  0012  0012  0012
x   0012  0013  0012  0013  0012  0013  0013  0012  0012  0013  0012  0013  0012  0013  0013  0012
y   0012  0012  0012  0012  0013  0012  0013  0012  0012  0012  0012  0012  0013  0012  0013  0012
z   0012  0013  0013  0013  0012  0013  0012  0012  0012  0013  0013  0013  0012  0013  0012  0012

#20 Updated by Nick Saxon almost 19 years ago

@22048 - trying crypt(), part 10:

It looks extremely close to ENCODE. So, the at least the final translation step should use something like this (of course, positions may vary greatly):

   static char* tab4 = 
                       "ABCD" "EFGH" "IJKL" "MNOP" 
                       "QRST" "UVWX" "YZqr" "stuv" 
                       "wxyz" "egmo" "egmo" "egmo" 
                       "ABCD" "EFGH" "IJKL" "MNOP" 
                       "QRST" "UVWX" "YZqr" "stuv" 
                       "wxyz" "fhnp" "fhnp" "fhnp" 
                       "fhnp" "fhnp" "fhnp" "fhnp" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                       "abcd" "ijkl" "abcd" "ijkl" 
                     ;

This table provides the observed distribution. The variations of distribution may be due to the difference in the input translation table or the crypt() salt used.

#21 Updated by Nick Saxon almost 19 years ago

@22049 - trying crypt(), part 11:

The next step is an attempt to find the salt and the input translation table that make the distribution to match precisely that of ENCODE. The code below tries that. It also tries to reconstruct the output translation table comparing side by side the emulated encryption and the true ENCODE output.

/* This program tries all 4K possible salt variations that feed the crypt()
   call used to produce (hypothetically) the first 8 characters of
   the ENCODE result.
   For every salt tried, the file with the known ENCODE results is used to
   try the translation table reconstruction. If a conflict is detected,
   the salt is rejected and the next one is tried. No conflicts found means
   this salt IS the one used in ENCODE and the translation table is built
   automatically.
*/
#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int trySalt(char* salt, char* filename);
int a64i(char a64c);

// This main program controls the main loop over all possible salt variations 
int main(int argc, char* argv[])
{
   static char* a64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789./";
   int i, j, k;
   char salt[3];
   char* crpt;

   // generating all possible variations of salt
   salt[2] = 0;
   for (i = 0; i < 64; i ++)
   {
      for (j = 0; j < 64; j ++)
      {
         salt[0] = *(a64 + i);
         salt[1] = *(a64 + j);
         k = trySalt(salt, argv[1]);
         printf("Salt %s tried. rc is %u\\n", salt, k);
         if (k == 0 || k == 3)
            break;
      }
      if (k == 0 || k == 3)
         break;
   }

   return 0;
}

/* This function tries all records of the file using given salt to see if
   crypt(record, salt) can construct a non-colliding translation table.

   Return code:
   0 - success. the successfully built translation table is printed out.
   1 - conflict: more that one charactes points at a table cell.
   2 - conflict: ran out of stock for a character.
   3 - couldn't open the file
*/
int trySalt(char* salt, char* filename)
{
   // translation table being built. 0 - unoccupied; x - occupied by x
   int table[256];
   int i = 0, j, k = 0, in = 0;

   // these are all allowed characters in encoded string 
   char* a52 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

   // these counts below reflect the available stocks for every character 
   int stock[52] = {  2,   2,   2,   2,   2,   2,   2,   2,// A B C D E F G H
                      2,   2,   2,   2,   2,   2,   2,   2,// I J K L M N O P
                      2,   2,   2,   2,   2,   2,   2,   2,// Q R S T U V W X
                      2,   2,  18,  18,  18,  18,   3,   7,// Y Z a b c d e f
                      3,   7,  18,  18,  18,  18,   3,   7,// g h i j k l m n
                      3,   7,   2,   2,   2,   2,   2,   2,// o p q r s t u v
                      2,   2,   2,   2                     // w x y z
                   };

   FILE* f;
   char rec[34];
   char inp[5];
   char* crpt;
   char c;
   int  i1, i2, i3, i4;
   unsigned long u;
   unsigned char* cc;
   int ind[9];    // one extra cell is needed temporarily
   int idx;
   int *t;
   int diff = 0, total = 0;

   for (i = 0; i < 256; i ++)
      table[i] = 0;

   f = fopen(filename, "r");
   if (f == NULL)
      return 3;

   inp[4] = 0;

   /* iterate through records of the file */
   while(fread(rec, 33, 1, f) == 1)
   {
      in++;

#22 Updated by Nick Saxon almost 19 years ago

@22051 - trying crypt(), part 12:

// encrypt the input 
      sscanf(rec, "%u %u %u %u", &i1, &i2, &i3, &i4);
      inp[0] = (char)i1;
      inp[1] = (char)i2;
      inp[2] = (char)i3;
      inp[3] = (char)i4;
      crpt = crypt(inp, salt) + 2;

      // turn the encrypted input into 8 indices ind[i]:
      // 3 groups of 4 bytes (including terminating 0)
      // converting every group into 3 bytes but using only 8 bytes
      for (cc = (unsigned char*)crpt, t = ind, i = 0; i < 3; i ++)
      {
         u = ((unsigned long)(a64i(*cc)) & 0x0000003f) << 18;
         cc ++;
         u += ((unsigned long)(a64i(*cc)) & 0x0000003f) << 12;
         cc ++;
         u += ((unsigned long)(a64i(*cc)) & 0x0000003f) << 6;
         cc ++;
         u += (unsigned long)(a64i(*cc)) & 0x0000003f;
         cc ++;

         // converting 32 bits into 24 bits
         for (j = 0; j < 3; j ++)
         {
            idx = (int)(u & 0x000000ff);
            *(t + 2 - j) = idx;
            u >>= 8;
         }
         t += 3;
      }

      // inspect first 8 characters of the ENCODED string 
      for (j = 0; j < 8; j ++)
      {
         c = rec[16 + j];
         // character c should correspond to the index ind[j]
         if (table[ind[j]] != 0 && table[ind[j]] != c)
         {
            printf("  scanned %u records; diff %u, total %u, char %c conflicts with %c\\n",
                    in, diff, total, c, table[ind[j]]);
            fclose(f);   
            return 1;      // conflicting characters
         }
         if (table[ind[j]] == c)
         {
            total ++;
            continue;      // repeatable characters
         }
         // this is the first character for the cell
         idx = index(a52, c) - a52;
         if (stock[idx] == 0)
         {
            printf("  scanned %u records; diff %u, total %u, char %c is used up\\n",
                   in, diff, total, c);
            fclose(f);   
            return 2;      // these characters are used up
         }
         stock[idx] --;
         table[ind[j]] = c;
         total ++;
         diff ++;
      }
   }
   fclose(f);   

   return 0;
}

#23 Updated by Nick Saxon almost 19 years ago

@22052 - trying crypt(), part 13:

// converts a byte of crypt output into 6-bit index into the 64-char table
int a64i(char a64c)
{
/* version 1
   static char* a64 = 
                      "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                      "abcdefghijklmnopqrstuvwxyz" 
                      "0123456789" 
                      "./" 
                    ;
*/                      
/* version 2
   static char* a64 = 
                      "./" 
                      "0123456789" 
                      "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                      "abcdefghijklmnopqrstuvwxyz" 
                    ;
*/                      
/* version 3
   static char* a64 = 
                      "./" 
                      "0123456789" 
                      "abcdefghijklmnopqrstuvwxyz" 
                      "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                    ;
*/                      
/* version 4
   static char* a64 = 
                      "0123456789" 
                      "./" 
                      "abcdefghijklmnopqrstuvwxyz" 
                      "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                    ;
*/                      
/* version 5
   static char* a64 = 
                      "0123456789" 
                      "./" 
                      "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                      "abcdefghijklmnopqrstuvwxyz" 
                    ;
*/                      
/* version 6
   static char* a64 = 
                      "abcdefghijklmnopqrstuvwxyz" 
                      "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 
                      "0123456789" 
                      "./" 
                    ;

   return index(a64, a64c) - a64;
*/
/* version 7 - just the least significant 6 bits:
   return a64c & 0x3f;
*/   
// version 8 - resequencing & using the least significant 6 bits
   if (a64c == '.')
      a64c = '@';
   else
      if (a64c >= '/' && a64c <= '4')
         a64c += 44;
      else
         if (a64c >= '5' && a64c <= '9')
            a64c += 70;
   return a64c & 0x3f;
}

All tried variations of the input translation are shown here as well.

#24 Updated by Nick Saxon almost 19 years ago

@22053 - trying crypt(), part 14:

This program runs a loop over all 4K variations of salt and tries to construct a translation table which has no conflicts. A conflict is detected whenever:
- a numerical index points at more than one character cell in the table;
- a character requires placement into the table beyond existing limits (to maintain the required statistical distribution).

The results are represented in the files res1.txt through res8.txt on phantom, /home/nvs.

Briefly, there was no salt found, that would allow building a non-conflicting translation table for all tried input translations and salt combinations. Typical number of records requred to hit a conflict was just a few. The maximum number of lines ever scanned was just 9!

It was easy to try all 4K combinations of salt. However, it was not an exaustive search. There were some unknowns:
- how exactly we pick the raw bytes from one or more crypted strings (too many possible variations to try);
- 64! variations of the input translation fom the significant bits into indices.

The expectation was to try something obvious and easy and see if it matched. It didn't work out well.

#25 Updated by Nick Saxon almost 19 years ago

@22055 - trying MDx, part 1:

MD2, MD4 and MD5 all produce 16 bytes of binary data, which are directly suitable for use as indices into the output translation table. MD5 is the strongest function among them, but chances are it wasn't available yet by the time Progress implemented ENCODE.

The beauty of MDx algorithms comparing with crypt() is they don't need an unknown salt parameter nor do they require any input translation. it would a perfect match for ENCODE.

#26 Updated by Nick Saxon almost 19 years ago

@22056 - trying MDx, part 2:

Having no convenient source in either C or Java for the MDx functions, i've decided to create separate files of preprocessed inputs: md2.txt, md4.txt amd md5.txt using jacksum package.

jacksum allows various checksum, CRC and cryptographic hash calculations. This is the driver script I created to run jacksum.

/* creating a file with MDx hash results */

parse arg mdx

/* loop for all i1 i2 i3 i4 from 1 to 255 */
i1 = 1
i2 = 1
i3 = 1
i4 = 1

do i = 1 to 16777215

   if i // 1000000 == 0 then
      say i "records generated" 

   inp = "dec:"i1","i2","i3","i4
   "java -jar jacksum.jar -a" mdx "-q" inp ">>"mdx".txt" 

   i4 = i4 + 1
   if i4 == 256 
   then do
      i4 = 1
      i3 = i3 + 1
      if i3 == 256
      then do
         i3 = 1;
         i2 = i2 + 1
         if i2 == 256 then leave
      end
   end

end

I've run this script as
 rexx mdx.rexx md2
 rexx mdx.rexx md4
 rexx mdx.rexx md5

Unfortunately, this utility is slow so I didn't produce the full range of encryptions. For the rough first pass of testing I needed very few records in each file.

#27 Updated by Nick Saxon almost 19 years ago

@22057 - trying MDx, part 3:

The modified test program for use MDx files follows.

/* This program tries encrypted input from the file which typically is  
   the result of one of the MDx hashing functions.
   Another file with the known ENCODE results is used to
   try the translation table reconstruction. If a conflict is detected,
   the algorithm is rejected. No conflicts found means
   this algorithm IS the one used in ENCODE and the translation table is built
   automatically.
*/
#define _XOPEN_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int tryAlgo(FILE* fenc, char* filename);
unsigned char hex(unsigned char h);

// This main program controls the main loop over all records of the MDx file 
int main(int argc, char* argv[])
{
   FILE* f;
   int k;

   f = fopen(argv[1], "r");
   if (f == NULL)
   {
      printf("can't open %s\\n", argv[1]);
      return 3;
   }

   k = tryAlgo(f, argv[2]);
   printf("File %s tried. rc is %u\\n", argv[1], k);

   return k;
}

/* This function tries all records of the file using given encryption to see 
   if it can construct a non-colliding translation table.

   Return code:
   0 - success. the successfully built translation table is printed out.
   1 - conflict: more that one charactes points at a table cell.
   2 - conflict: ran out of stock for a character.
   3 - couldn't open the file
*/
int tryAlgo(FILE* fenc, char* filename)
{
   // translation table being built. 0 - unoccupied; x - occupied by x
   int table[256];
   int i = 0, j, k = 0, in = 0;

   // these are all allowed characters in encoded string 
   char* a52 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

   // these counts below reflect the available stocks for every character 
   int stock[52] = {  2,   2,   2,   2,   2,   2,   2,   2,// A B C D E F G H
                      2,   2,   2,   2,   2,   2,   2,   2,// I J K L M N O P
                      2,   2,   2,   2,   2,   2,   2,   2,// Q R S T U V W X
                      2,   2,  18,  18,  18,  18,   3,   7,// Y Z a b c d e f
                      3,   7,  18,  18,  18,  18,   3,   7,// g h i j k l m n
                      3,   7,   2,   2,   2,   2,   2,   2,// o p q r s t u v
                      2,   2,   2,   2                     // w x y z
                   };

   FILE* f;
   char enc[34];
   char rec[34];
   char inp[5];
   char* crpt;
   char c;
   int  i1, i2, i3, i4;
   unsigned long u;
   unsigned char* cc;
   unsigned char ind[17];
   int idx;
   int *t;
   int diff = 0, total = 0;

   for (i = 0; i < 256; i ++)
      table[i] = 0;

   f = fopen(filename, "r");
   if (f == NULL)
      return 3;

   inp[4] = 0;

   /* iterate through records of the file */
   while(fread(rec, 33, 1, f) == 1)
   {
      in++;

      // get encryption from the file
      if (fread(enc, 33, 1, fenc) != 1)
      {
         printf("End of encrypted input after %u records\\n", in --);
         return 4;
      }

      // turn the encrypted input into 16 indices ind[i]:
      for (i = 0; i < 16; i ++)
      {
         c = enc[2 * i];
         ind[i] = 16 * hex(c);
         c = enc[2 * i + 1];
         ind[i] += hex(c);
      }

#28 Updated by Nick Saxon almost 19 years ago

@22058 - trying MDx, part 4:

enc[32] = 0;
      printf("Rec %s, hex1 %u hex2 %u\\n", enc, hex(enc[0]), hex(enc[1]));
      printf("Rec %s\\n", enc);
      printf("Rec %u indices %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u %03u\\n",
             in,  ind[0], ind[1], ind[2], ind[3], ind[4], ind[5], ind[6], ind[7], ind[8],
             ind[9], ind[10], ind[11], ind[12], ind[13], ind[14], ind[15]);

      // inspect all 16 characters of the ENCODED string 
      for (j = 0; j < 16; j ++)
      {
         c = rec[16 + j];
         // character c should correspond to the index ind[j]
         if (table[ind[j]] != 0 && table[ind[j]] != c)
         {
            printf("  scanned %u records; diff %u, total %u, char %c conflicts with %c\\n",
                    in, diff, total, c, table[ind[j]]);
            fclose(f);   
            return 1;      // conflicting characters
         }
         if (table[ind[j]] == c)
         {
            total ++;
            continue;      // repeatable characters
         }
         // this is the first character for the cell
         idx = index(a52, c) - a52;
         if (stock[idx] == 0)
         {
            printf("  scanned %u records; diff %u, total %u, char %c is used up\\n",
                   in, diff, total, c);
            fclose(f);   
            return 2;      // these characters are used up
         }
         stock[idx] --;
         table[ind[j]] = c;
         total ++;
         diff ++;
      }
   }
   fclose(f);   

   return 0;
}

// converts a hex char into an integer
unsigned char hex(unsigned char h)
{
   if (h >= 'a')
      h = h - 'a' + 10;
   else
      if (h >= 'A')
         h = h - 'A' + 10;
      else
         h = h - '0';
   return h;
}

#29 Updated by Nick Saxon almost 19 years ago

@22059 - trying MDx, part 5:

The results of all MDx tests were similar.

./tmdx md2.txt enc4-1.txt
Rec bd86b73d26b8f71598f8986363e95afd, hex1 11 hex2 13
Rec bd86b73d26b8f71598f8986363e95afd
Rec 1 indices 189 134 183 061 038 184 247 021 152 248 152 099 099 233 090 253
  scanned 1 records; diff 10, total 10, char f conflicts with l
File md2.txt tried. rc is 1

./tmdx md4.txt enc4-1.txt
Rec 9709c981e59b4ec93c6e1cdcf248b32b, hex1 9 hex2 7
Rec 9709c981e59b4ec93c6e1cdcf248b32b
Rec 1 indices 151 009 201 129 229 155 078 201 060 110 028 220 242 072 179 043
  scanned 1 records; diff 7, total 7, char n conflicts with Z
File md4.txt tried. rc is 1

./tmdx md5.txt enc4-1.txt
Rec 3b5b9852567ef7618aac7f5f2d74ef74, hex1 3 hex2 11
Rec 3b5b9852567ef7618aac7f5f2d74ef74
Rec 1 indices 059 091 152 082 086 126 247 097 138 172 127 095 045 116 239 116
  scanned 1 records; diff 15, total 15, char e conflicts with a
File md5.txt tried. rc is 1

In all cases, the algorithms couldn't stand even a single record, which is worse than with crypt(), although this is not a good measure to basy any decision on.

#30 Updated by Nick Saxon almost 19 years ago

@22060 - possible next steps:

Since we consider md5 as an alternative, we should consider crypt-md5 as well. crypt-md5 is slightly better choice of hashing function than the plain crypt.

Pros:
All input characters are significant and the result is 132 bits from where 128 bits needed to create 16 indices are easily extractable.

Cons:
It takes an arbitrary string as the salt, although I don't understand where that salt fits.

#31 Updated by Redmine Admin almost 19 years ago

@22070 - Status changed from WIP to Hold

#32 Updated by Redmine Admin almost 13 years ago

@45256 - Status changed from Hold to Pending

#33 Updated by Greg Shah about 12 years ago

Imported from JPRM on 2012-04-12 11:23:22.035:

TASKID      = 3762
PROJECTID   = 125
STATUS      = 2 (Pending)
DESCRIPTION = implement a compatible ENCODE built-in function
OWNER       = GES
ASSIGNEE    = 
BILLABLE    = false
PRIORITY    = 5 (Normal)
PHASE       = 31 (Development)
COMPONENT   = 97
ESTEFFORT   = 0.000000
ESTSTART    = 2005-07-30
ESTSTOP     = 2005-08-10
ACTSTOP     = 
CASENUM     = 
VENDORID    = 
COMMENT     = ''
LASTWIP     = 2005-08-09

#34 Updated by Greg Shah over 11 years ago

Status changed from Hold to WIP
Estimated time set to 160.00

#35 Updated by Greg Shah over 11 years ago

Project changed from Core Development to Base Language

#36 Updated by Greg Shah over 11 years ago

Target version set to Milestone 7

#37 Updated by Greg Shah about 11 years ago

Target version changed from Milestone 7 to Milestone 11

#38 Updated by Greg Shah about 11 years ago

File stack_overflow_article_11742310_progress_10.1C_4gl_encode_function.pdf added
File progress_knowledge_base_article_21685_one_way_uniqueness_of_progress_4gl_encode_function.pdf added
File progress_knowledge_base_article_10715_what_is_the_algorith_used_by_encode_4gl_function.pdf added

Some useful information:

http://knowledgebase.progress.com/articles/Article/21685
http://knowledgebase.progress.com/articles/Article/P111508
http://stackoverflow.com/questions/11742310/progress-10-1c-4gl-encode-function

Progress reports that they use a non-standard CRC-16 algorithm.

#39 Updated by Greg Shah about 11 years ago

More resources:

http://en.wikipedia.org/wiki/Cyclic_redundancy_check
http://stackoverflow.com/questions/10564491/function-to-calculate-a-crc16-checksum
http://www.cosc.canterbury.ac.nz/greg.ewing/essays/CRC-Reverse-Engineering.html

The 2nd article has some useful notes about variations in CRC algorithms that use the same polynomial. The 3rd article defines a useful mathematical approach to determine the CRC algorithm using only valid combinations of inputs and outputs. It is well known that CRC algorithms can be discovered just by knowing some valid output. Our problem is that we don't strictly have the exact outputs, but a transformed output from a 52-character alphabet.

#40 Updated by Greg Shah about 11 years ago

File crypto_pet_peeves_hashing_encoding_its_all_the_same_right_patrick_toomey_neohapsis_20080825.pdf added

And there is this:

http://labs.neohapsis.com/2008/08/25/crypto-pet-peeves-hashingencodingits-all-the-same-right/

#41 Updated by Greg Shah about 11 years ago

Due date changed from 08/10/2005 to 10/25/2013
Start date changed from 07/30/2005 to 08/17/2013

#42 Updated by Greg Shah about 10 years ago

Target version changed from Milestone 11 to 23

#43 Updated by Greg Shah almost 10 years ago

File crc_v3.txt added

The following link (see attached for the actual text file that is linked) is a really useful primer on the theory behind CRC algorithms.

http://www.ross.net/crc/download/crc_v3.txt

#44 Updated by Greg Shah almost 10 years ago

File lgpl_c_sharp_encode_source_20140902.pdf added
File stack_overflow_11742310_progress_10.1c_4gl_encode_function_updated_20140902.pdf added
File stack_exchange_how_can_i_reverse_engineer_a_hash_code_4gl_encode_20140902.pdf added
File lgpl_github_project_for_progress_encode_20140902.pdf added

New/updated articles have been found:

http://programmers.stackexchange.com/questions/206645/how-can-i-reverse-engineer-a-hash-code

The interesting thing that is linked in the above article is that the author actually "was able to extract the algorithm" and has created an LGPL licensed C# implementation that appears to be compatible!

I don't know the means by which he extracted the algorithm, but it really doesn't matter since the algorithm has now been published for the world to see. That means that the algorithm is no longer a trade secret. In other words, the "genie" has been let out of the bottle.

We won't use the LGPL source code (attached here) directly. Instead, I will analyze the open source code and use that analysis to document the algorithm. Then we will write our own version from scratch. This way it doesn't matter whether or not the algorithm was obtained in a legal manner, since we are only using publicly available information to implement our own version.

https://github.com/pvginkel/ProgressEncode
https://github.com/pvginkel/ProgressEncode/blob/master/ProgressEncode/Progress.cs
http://programmers.stackexchange.com/users/5769/pieter-van-ginkel
http://webathome.org/

#45 Updated by Greg Shah almost 10 years ago

Here is the Progress 4GL compatible ENCODE algorithm specification.

Summary

The function takes a source byte array (of any size) and generates a 16-byte one-way hash using a proprietary approach combined with a variant of CRC-16.

General Description

The basic approach is to spread out the source data in an intermediate accumulator array of 16 unsigned bytes and to repeatedly CRC that data
and copy the resulting 2-byte CRC into successive elements of this intermediate accumulator, while building the next CRC results on the accumulated CRC value and the recently modified intermediate accumulator. This set of CRCs is calculated 8 times (since that will generate 16 bytes).

This basic approach described above is executed 5 times in a row. Each byte of the resulting hashed intermediate accumulator array is then translated into one of the 52 possible uppercase or lowercase English alphabetic characters. That translated result is returned.

Detailed Algorithm

1. Initialize storage that can hold at least 2 unsigned bytes to 0x11. This is the accumulator for the calculated CRC.

2. Allocate storage that can hold 16 unsigned bytes. This will be used as an intermediate output accumulator which will combine the input data and the calculated CRC data in a manner that will result in each of the 16 bytes having a value between 0 and 255 inclusive.

3. Iterate the following steps 5 times:

3a. Walk forward through the array of source data provided and XOR each byte into the intermediate output accumulator (target array), but with the index of the target matching a backwards walk. Since the source array can be of any length, the lower index positions in the target may not ever get source data XOR'd. Source arrays longer than the target array cause wrapping and that means that the higher index positions in the target array will have data from multiple source positions XOR'd. See the section below entitled Spread Function for more specifics as well as the significant cryptographic limitations this causes.

3b. Iterate 8 times, doing the following:

3b1. Calculate the CRC-16 of the current data in both the intermediate output accumulator as well as the CRC accumulator. This is the core CRC algorithm and it will calculate the CRC of the current state of the intermediate output accumulator while including the initial state of the CRC accumulator. The resulting calculated CRC is assigned back to the CRC accumulator. The CRC-16 algorithm used is not cryptographically sound. It generates a non-uniform distribution of hashed data and collisions are not rare. See the section below entitled CRC-16 Function for details.

3b2. The least significant byte (byte 0) of the CRC accumulator will be directly stored into an index position in the intermediate output accumulator. This index position starts at 0 and increments by 2 for every iteration of the nearest containing loop. Since the intermediate output accumulator has 16 elements, that is why the containing loop iterates only 8 times. This byte must be treated as an unsigned value.

3b3. The next to least significant byte (byte 1) of the CRC accumulator will be directly stored into an index position in the intermediate output accumulator. This index position starts at 1 and increments by 2 for every iteration of the nearest containing loop. This byte must be treated as an unsigned value.

4. At this point the intermediate output accumulator contains 16 poorly distributed hashed bytes. Translate each byte of the intermediate output accumulator into a byte from the valid output character set. Valid output characters include all English alphabetic letters (both uppercase and lowercase), for 52 possible output characters. See the section below entitled Output Translation Function for details on this process and for the severe limitations of the resulting non-uniform distribution.

The result of this algorithm is NOT cryptographically sound. It should not be used for secure purposes. The flaws in this algorithm come from the processing associated with the Spread Function, the CRC-16 Function and the Output Translation Function.

Spread Function

Walk forward through the array of source data provided and XOR each byte into the target array, but with the index of the target matching a backwards walk.

The source data can be an array of any length and the target length is not expected to match. The target indexes are calculated modulo the length of the target array, which means that for a source array longer than the target array, one or more target array elements will store data XOR'd from more than one source source array element (i.e. the XOR processing will wrap around).

For a source array smaller than the target array, there will be some low index elements of the target array which will not receive any XOR'd data. Only in the case where the two arrays are the same length will there be no wrapping and no unmerged elements in the target array. Only in the case where the source array is modulo the size of the target array will all elements of the target array be modified.

From a cryptographic perspective, the algorithm is quite poor. Source arrays smaller than the target arrays will leave target array bytes unmodified. Source arrays larger than the target array size will have some (or all) elements modified multiple times. Both cases are causes of concern. Not modifying elements means that there is less input provided for distribution of results. Modifying elements more than once can
cause issues as well. Consider the case where the same character appears in the source array in the first byte (index 0) and in the byte modulo the size of the target. Because of wrapping, this character will be XOR'd twice into the same element of the target array. XOR is its own inverse. If you XOR the same data into a byte an even number of times, then the resulting byte will be unchanged. This will have the effect of reducing the distribution of the resulting data for some inputs. This is not suitable for secure purposes.

param1 source
Source array to XOR from. Must not be null. May be of any length.

param2 target
Target array to XOR into. Must not be null. May be of any length.

No data is returned.

CRC-16 Function

This implements a standard CRC-16 (also known as CRC-16-IBM or CRC-16-IBM) algorithm that uses a reversed polynomial of 0xA001 and swapped byte ordering. Using a reversed polynomial means it processes each byte's least significant bit first (it shifts the binary data to the right). Using swapped byte ordering means that the input data (which is being treated as an arbitrarily large binary number) must be processed from highest element to lowest element, whereas normally one might consider the most significant byte to be in the highest array index position, this algorithm assumes the opposite.

If this algorithm generated uniformly distributed hashes, then in a best case scenario the probability of a collision between any 2 items is (1 / 2^16) or .00152588 %. That seems good until one considers that between any 300 items there is a 50% chance of collisions and between any 430 items there is a 75% chance of a collision! This is the well known birthday problem (see http://en.wikipedia.org/wiki/Birthday_attack).

Please note that the CRC-16 algorithm is not suitable for cryptographic hashing. It does NOT generate uniformly distributed hashes. This means that the collision rate is not even as good as the best case scenario. Even if it did have uniform distribution, the small number of bits means that collisions are not very rare. CRC is more suitable for error detection. It should NOT be used for secure purposes.

The input parameters should be:

param1 data
The bytes of data to CRC.

param2 crc
The initial CRC value into which each element of the data will be XOR'd.

The return value is the calculated CRC value after factoring in all data bytes and binary dividing the polynomial.

Output Translation Function

Convert the source array into an array of bytes with the same number of elements, but where each byte may only contain uppercase or lowercase English alphabetic characters (a-z and A-Z).

The source array will have each element translated into a corresponding element in the output array. Only the least significant byte of the source array element is considered in the translation algorithm. The order of the elements in the output array will be the same order as the source array (e.g. the first source element translates to the first output element and so forth).

The following algorithm is used to translate the 256 possible source byte values into the 52 possible output byte values:

If the least significant 7 bits of the source byte are one of the 52 possible English alphabetic characters, then that is the resulting byte. This will yield a byte with 0x41 ('A') through 0x5A ('Z') or 0x61 ('a') - 0x7A ('z'), inclusive.
Otherwise, the most significant (upper) nibble of the source byte will be used to select a character from 'a' through 'q'. This nibble can only have one of 16 possible values (0 through 15). This nibble value is used as a direct index into the 16 possible lowercase output letters. Thus, a nibble value of 0 (0x0) will yield 'a' (0x61), 1 (0x1) will yield 'b' (0x62) and so on, with the last possible value of 15 (0xF) yielding 'p' (0x70).

Assuming the source bytes are uniformly distributed, this translation approach is guaranteed to result in a highly UNEVEN distribution of output bytes. This is cryptographically BAD and should NOT be used for secure purposes. To understand why this occurs:

Source Byte Range    Output Byte                               Distribution Notes
-----------------    -----------    ----------------------------------------------------------------------------
0x00 - 0x40          'a' - 'e'      'a' through 'd' are 16X more likely than 'e'
0x41 - 0x5A          'A' - 'Z'      all equally likely (1 to 1 mapping of input and output)
0x5B - 0x60          'f' - 'g'      'f' 5X more likely than 'g'
0x61 - 0x7A          'a' - 'z'      all equally likely (1 to 1 mapping of input and output)
0x7B - 0xC0          'h' - 'm'      'i' through 'l' are 16X more likely than 'm', 'h' is 5X more likely than 'm'
0xC1 - 0xDA          'A' - 'Z'      all equally likely (1 to 1 mapping of input and output)
0xDB - 0xE0          'n' - 'o'      'n' 5X more likely than 'o'
0xE1 - 0xFA          'a' - 'z'      all equally likely (1 to 1 mapping of input and output)
0xFB - 0xFF          'p'            only 'p' is possible

More graphically, here is the exact distribution that will occur for perfectly distributed input:

Character   Frequency          Histogram
---------   ---------   -----------------------
A           2           ++
B           2           ++
C           2           ++
D           2           ++
E           2           ++
F           2           ++
G           2           ++
H           2           ++
I           2           ++
J           2           ++
K           2           ++
L           2           ++
M           2           ++
N           2           ++
O           2           ++
P           2           ++
Q           2           ++
R           2           ++
S           2           ++
T           2           ++
U           2           ++
V           2           ++
W           2           ++
X           2           ++
Y           2           ++
Z           2           ++
a           18          ++++++++++++++++++
b           18          ++++++++++++++++++
c           18          ++++++++++++++++++
d           18          ++++++++++++++++++
e           3           +++
f           7           +++++++
g           3           +++
h           7           +++++++
i           18          ++++++++++++++++++
j           18          ++++++++++++++++++
k           18          ++++++++++++++++++
l           18          ++++++++++++++++++
m           3           +++
n           7           +++++++
o           3           +++
p           7           +++++++
q           2           ++
r           2           ++
s           2           ++
t           2           ++
u           2           ++
v           2           ++
w           2           ++
x           2           ++
y           2           ++
z           2           ++

This non-uniform distribution is purely due to the fallback approach of using the upper nibble of some bytes as a selector into a subset of the possible output values. Since the only a subset of the output values are targeted, it makes those values more likely to occur. In addition, because there are some upper nibble values that are less likely to be encountered (because they rarely trigger the fallback mechanism in the first place), this causes an additional level of non-uniformity even within the subset that is possible to be selected.

param1 source
The bytes to convert. Must not be null. Only the least significant byte of each element will be used.

The return value is the converted array of bytes, with one element for each corresponding element of the source array.

#46 Updated by Greg Shah over 9 years ago

File CustomCRC16Encoder.java added

I have written a Java implementation based on the spec. I also have tested/fixed this implementation with over 750K password values, all of which now get identical output. This implementation is ready for integration with P2J which is the next step.

The only additional finding that wasn't previously documented is the handling of null characters. In the 4GL it is possible to get a null character (or more than one) into a character variable using the 3 parameter version of GET-STRING(). When this occurs, the resulting ENCODE processing of that string truncates the data at the first null character. It treats the string as if it ends at that location, even if there is additional data following the null character.

Attached are the test programs, the program used for generating most of the password inputs and the compatible Java encode implementation.

#47 Updated by Greg Shah over 9 years ago

File StringGenerator.java added
File encode_binary_data_input.p added
File encode_text_input.p added

Here are the other files.

#48 Updated by Greg Shah over 9 years ago

File ges_upd20141107a.zip added

This is the P2J update that contains the new compatible ENCODE implementation in a fully integrated state. The compatible ENCODE (SecurityOps._encode()) is the version that is used by default, but if the following value is added to the directory under /server/default/, then the old incompatible ENCODE (SecurityOps._incompatible_encode()) will be used instead:

        <node class="boolean" name="encode-compatibility">
          <node-attribute name="value" value="FALSE"/>
        </node>

Also included in this update are many unrelated changes:

FINALLY block support (conversion and runtime).
LOG-MANAGER and DSLOG-MANAGER system handle conversion support and stubs including all possible attributes and methods.
A new map-based approach to specifying conversion for attributes and methods. See the load_descriptors function in methods_attributes.rules.
UnimplementedFeature support (see below).
Some improvements for null character handling in both BinaryData and in the character/longchar/Text/TextOps classes. It turns out that it is possible to get a null character into a character var (but not into a longchar as far as we know). Our null character handling is still incomplete, but this does make the current code better. See #2429.
I/O options support for conversion of NO-CONVERT and CONVERT SOURCE/TARGET. The runtime stubs are there too.
CPINTERNAL support (it is not complete yet, but it will work for current needs).
COPY-LOB support for additional variants, both conversion and some incomplete but working first passes at runtime function.
Report fixes and improvements.

In regard to the UnimplementedFeature class, the idea is that this is a better way to specify TODOs in the code. The idea is that if you specify a todo this way and the functionality is actually in use, then you will get a nice log message in the server log (including some descriptive text AND a stack trace) to show where the usage occurred. This makes it easier to debug problems that are caused by missing functionality. There are 3 possible static methods to call missing(), partial() and todo(). missing() is appropriate for methods that are completely unimplemented. partial() may be useful for something that is there but has some significant portion missing. todo() is useful for everything else. Please note that it makes sense to make the partial() and todo() usage controlled by conditional logic so that they only emit when the actual problem/missing code would be needed. The idea is to (over time) replace all TODO comments with this usage.

I could see 2 potential improvements:

1. Enable this for use on the client side (writing to the server log). This is something we need for logging in general I think.
2. Make it possible that only the first invocation (of a particular call to one of these methods) is logged, but not subsequent ones. An example is in Utils.getCharsetOverride() where I initially put a call to todo(), but it just filled up the logfile. So I commented it out. But I would like 1 of the calls to actually be made and the others to be suppressed.

#49 Updated by Greg Shah over 9 years ago

Assignee set to Greg Shah
Status changed from WIP to Review

#50 Updated by Constantin Asofiei over 9 years ago

About 1107a.zip; beside the following issues, I don't see any other problems in the logic.

about the UnimplementedFeature: if we want to track only the fact that a portion of code which is not yet complete has been reached, then we can go down the stacktrace and stop on the first entry which is for a code not in the UnimplementedFeature class; we can go even further and collect all entries which are in the P2J source code (from this point downward). Add this StackTraceElement(s) - their combined hash-code can be used - to a set and log only the first occurrence.
Text.java has the line terminator changed to windows-style
about the FINALLY block: in 4GL, does a condition raised in the body get overwritten by a condition raised in the FINALLY block?
the interface for the new resource (LegacyLogManager) needs to be added to the HandleCommon interface
character.java - you have added assumptions that a null BDT means an unknown BDT... until now, my assumption was that if a BDT parameter exists, then it must always be not-null. If it is null, IMO then this is a programming error on the conversion/developer side.
SecurityOps.compatible - this is a static field and will be initialized during the import process, too, as the class is used by the ImportWorker. The directory access needed for this field might break the import.

#51 Updated by Greg Shah over 9 years ago

about the UnimplementedFeature: if we want to track only the fact that a portion of code which is not yet complete has been reached, then we can go down the stacktrace and stop on the first entry which is for a code not in the UnimplementedFeature class; we can go even further and collect all entries which are in the P2J source code (from this point downward). Add this StackTraceElement(s) - their combined hash-code can be used - to a set and log only the first occurrence.

My primary concern with this is that it is costly to implement, even on the subsequent calls. The one place that I had to disable use of UnimplementedFeature is called so often that this stack processing approach is something I wouldn't want to enable.

We could always implement a token based approach (similar to how we use tokens to manage calls to RemoteObject. For most use cases, we wouldn't store or care about the token. For those cases we use the same API as today. But in those cases where the logging is too much, we save off the returned token and use an overloaded variant of the method that takes an additional token parameter. If the token is in a set, then we don't log.

What do you think?

Text.java has the line terminator changed to windows-style

I edit exclusively on Linux. I'm not sure when this was added, but it was already there in bzr when I started. I have fixed this.

about the FINALLY block: in 4GL, does a condition raised in the body get overwritten by a condition raised in the FINALLY block?

I'm not sure. I haven't fully tested the 4GL functionality yet. I doubt that we need full support yet. The following are the open TODOs:

flow of control statement effects (inside the "associated" block and the finally block)
references to labels from within the finally block
nested blocks in a finally
error handling
transaction state

The 4GL docs state that you can have arbitrarily complex/nested code inside finally blocks, so I suspect there will be some trickiness to find/fix. But for now the implementation should be sufficient.

the interface for the new resource (LegacyLogManager) needs to be added to the HandleCommon interface

I'll fix this.

character.java - you have added assumptions that a null BDT means an unknown BDT... until now, my assumption was that if a BDT parameter exists, then it must always be not-null.

That is true.

If it is null, IMO then this is a programming error on the conversion/developer side.

I agree, but I think this is a safer approach considering that the API is going to be used more widely than just with code that we generate. Since an increasing number of customers will have their programmers writing code that uses our APIs, I think it is safer to err on the side of safety, even if it allows some sloppiness or caller-bugs to occur without a hard failure.

SecurityOps.compatible - this is a static field and will be initialized during the import process, too, as the class is used by the ImportWorker. The directory access needed for this field might break the import.

I considered this when I made the change. I believe the use of the directory is already there for import. The static member customSecurityOps is unconditionally initialized using instantiateUserHandler() which uses the directory. It is safe because there is a default specified. I have likewise put a default value in the new lookup, such that during import processing this should fall-back gracefully as well.

#52 Updated by Constantin Asofiei over 9 years ago

Greg Shah wrote:

My primary concern with this is that it is costly to implement, even on the subsequent calls. The one place that I had to disable use of UnimplementedFeature is called so often that this stack processing approach is something I wouldn't want to enable.

We could always implement a token based approach (similar to how we use tokens to manage calls to RemoteObject. For most use cases, we wouldn't store or care about the token. For those cases we use the same API as today. But in those cases where the logging is too much, we save off the returned token and use an overloaded variant of the method that takes an additional token parameter. If the token is in a set, then we don't log.

What do you think?

In either case, the runtime will have an additional overhead if this is in place; the "unimplemented" parts will never reach production code, but we might need to add a flag to disable the UnimplementedFeature for production use (unless the overhead is proven to be minimum...). Your approach using a token should be OK, although it will take some trial-end-error to determine which calls need to use the token and which not.

If it is null, IMO then this is a programming error on the conversion/developer side.

I agree, but I think this is a safer approach considering that the API is going to be used more widely than just with code that we generate. Since an increasing number of customers will have their programmers writing code that uses our APIs, I think it is safer to err on the side of safety, even if it allows some sloppiness or caller-bugs to occur without a hard failure.

I understand; but I think we should make this part of our standard, so all APIs exposed to developers/emitted by conversion rules will work in the same way.

#53 Updated by Greg Shah over 9 years ago

I think we should make this part of our standard, so all APIs exposed to developers/emitted by conversion rules will work in the same way.

Agreed.

#54 Updated by Greg Shah over 9 years ago

File ges_upd20141110a.zip added

This is the update with the suggested changes.

#55 Updated by Constantin Asofiei over 9 years ago

Greg Shah wrote:

This is the update with the suggested changes.

The only comment I have is that if UnimplementedFeature is used multiple times in a file, the tokens can't be shared; it might get tricky maintaining it at some point. We need some guidelines as to not share the tokens, remove them after the UnimplementedFeature was removed, etc.

#56 Updated by Greg Shah over 9 years ago

File ges_upd20141110b.zip added

Here is a version with enhanced javadoc that provides more detailed guidance for using UnimplementedFeature.

#57 Updated by Greg Shah over 9 years ago

This version is merged with 10651. I'm going to do a regression testing run, but there are conflicting changes in #2425 which will be checked in first, so I expect to run another round after my final merge.

#58 Updated by Greg Shah over 9 years ago

File ges_upd20141111a.zip added

And the update...

#59 Updated by Greg Shah over 9 years ago

This is merged up with 10652 and I have started regression testing. Conversion and runtime regression testing on the previous update did pass, so I expect this to pass as well.

With this update, I decided to disable the use of Unimplemented.todo() in Utils. Since we don't have support for remote logging yet, each client log gets a log entry with a runtime exception. Even the conversion output contains this:

     [java] Nov 11, 2014 1:16:11 PM UnimplementedFeature TODO
     [java] SEVERE: {main} TODO: This code doesn't properly handle UTF-8.
     [java] java.lang.RuntimeException
     [java]     at com.goldencode.p2j.util.UnimplementedFeature.todo(UnimplementedFeature.java:187)
     [java]     at com.goldencode.p2j.util.UnimplementedFeature.todo(UnimplementedFeature.java:217)
     [java]     at com.goldencode.p2j.util.Utils.getCharsetOverride(Utils.java:566)
     [java]     at com.goldencode.p2j.util.Text.<clinit>(Text.java:44)
     [java]     at com.goldencode.p2j.schema.SchemaParser.valExp(SchemaParser.java:1776)
     [java]     at com.goldencode.p2j.schema.SchemaParser.field(SchemaParser.java:1498)
     [java]     at com.goldencode.p2j.schema.SchemaParser.table(SchemaParser.java:666)
     [java]     at com.goldencode.p2j.schema.SchemaParser.schema(SchemaParser.java:398)
     [java]     at com.goldencode.p2j.schema.SchemaLoader.importSchema(SchemaLoader.java:332)
     [java]     at com.goldencode.p2j.schema.SchemaLoader.importAll(SchemaLoader.java:259)
     [java]     at com.goldencode.p2j.convert.ConversionDriver.runSchemaLoader(ConversionDriver.java:363)
     [java]     at com.goldencode.p2j.convert.ConversionDriver.front(ConversionDriver.java:283)
     [java]     at com.goldencode.p2j.convert.ConversionDriver.main(ConversionDriver.java:1733)

The remote logging is not simple to implement, since there are some operations (like generating stack traces) which inherently should occur on the system generating the log entry and other operations that really need to be remoted. In addition, since the J2SE Logger is a class and not an interface, implementing a dynamic proxy for it doesn't work. When combined with the UnimplementedFeature need for some kind of remotable token processing, the overall solution is something that needs to be done later (I have deliverables in the short term that can't be delayed).

#60 Updated by Greg Shah over 9 years ago

File ges_upd20141112a.zip added

I forgot the update again...

#61 Updated by Greg Shah over 9 years ago

Both conversion and main runtime regression testing has passed. Only spurious issues in tc_job_clock_005 and tc_job_matlcron_002 remain other than the other 3 expected failures. I didn't re-run CTRL-C testing since it had passed previously. I've committed the update as bzr 10654.

#62 Updated by Greg Shah over 9 years ago

Status changed from Review to Closed

#63 Updated by Greg Shah over 7 years ago

Target version changed from 23 to Cleanup and Stablization for Server Features

Also available in: Atom PDF

Project

General

Profile

FWD » Core Development » Base Language

Issues

Custom queries

Feature #27

implement a compatible ENCODE built-in function

History

#1 Updated by Nick Saxon almost 19 years ago

#2 Updated by Nick Saxon almost 19 years ago

#3 Updated by Nick Saxon almost 19 years ago

#4 Updated by Nick Saxon almost 19 years ago

#5 Updated by Nick Saxon almost 19 years ago

#6 Updated by Nick Saxon almost 19 years ago

#7 Updated by Nick Saxon almost 19 years ago

#8 Updated by Nick Saxon almost 19 years ago

#9 Updated by Nick Saxon almost 19 years ago

#10 Updated by Nick Saxon almost 19 years ago

#11 Updated by Nick Saxon almost 19 years ago

#12 Updated by Nick Saxon almost 19 years ago

#13 Updated by Nick Saxon almost 19 years ago

#14 Updated by Nick Saxon almost 19 years ago

#15 Updated by Nick Saxon almost 19 years ago

#16 Updated by Nick Saxon almost 19 years ago

#17 Updated by Nick Saxon almost 19 years ago

#18 Updated by Nick Saxon almost 19 years ago

#19 Updated by Nick Saxon almost 19 years ago

#20 Updated by Nick Saxon almost 19 years ago

#21 Updated by Nick Saxon almost 19 years ago

#22 Updated by Nick Saxon almost 19 years ago

#23 Updated by Nick Saxon almost 19 years ago

#24 Updated by Nick Saxon almost 19 years ago

#25 Updated by Nick Saxon almost 19 years ago

#26 Updated by Nick Saxon almost 19 years ago

#27 Updated by Nick Saxon almost 19 years ago

#28 Updated by Nick Saxon almost 19 years ago

#29 Updated by Nick Saxon almost 19 years ago

#30 Updated by Nick Saxon almost 19 years ago

#31 Updated by Redmine Admin almost 19 years ago

#32 Updated by Redmine Admin almost 13 years ago

#33 Updated by Greg Shah about 12 years ago

#34 Updated by Greg Shah over 11 years ago

#35 Updated by Greg Shah over 11 years ago

#36 Updated by Greg Shah over 11 years ago

#37 Updated by Greg Shah about 11 years ago

#38 Updated by Greg Shah about 11 years ago

#39 Updated by Greg Shah about 11 years ago

#40 Updated by Greg Shah about 11 years ago

#41 Updated by Greg Shah about 11 years ago

#42 Updated by Greg Shah about 10 years ago

#43 Updated by Greg Shah almost 10 years ago

#44 Updated by Greg Shah almost 10 years ago

#45 Updated by Greg Shah almost 10 years ago

#46 Updated by Greg Shah over 9 years ago

#47 Updated by Greg Shah over 9 years ago

#48 Updated by Greg Shah over 9 years ago

#49 Updated by Greg Shah over 9 years ago

#50 Updated by Constantin Asofiei over 9 years ago

#51 Updated by Greg Shah over 9 years ago

#52 Updated by Constantin Asofiei over 9 years ago

#53 Updated by Greg Shah over 9 years ago

#54 Updated by Greg Shah over 9 years ago

#55 Updated by Constantin Asofiei over 9 years ago

#56 Updated by Greg Shah over 9 years ago

#57 Updated by Greg Shah over 9 years ago

#58 Updated by Greg Shah over 9 years ago

#59 Updated by Greg Shah over 9 years ago

#60 Updated by Greg Shah over 9 years ago

#61 Updated by Greg Shah over 9 years ago

#62 Updated by Greg Shah over 9 years ago

#63 Updated by Greg Shah over 7 years ago