CustomCRC16Encoder.java

compatible Java ENCODE implementation (NOT integrated into P2J) - Greg Shah, 11/04/2014 03:01 PM

       import java.io.*;
       import java.util.*;
       /**
        * Progress 4GL compatible <code>ENCODE</code> algorithm.
        */
       public class CustomCRC16Encoder
+      {
          /** Initial CRC value. */
          public static final int INITIAL_CRC_VALUE = 0x11;
          /** Number of bytes in the encoded output. */
          public static final int OUTPUT_ARRAY_SIZE = 16;
          /** Number of iterations in the core encoding loop. */
          public static final int CORE_LOOP_PASSES = 5;
          /** Reversed polynomial used with the CRC-16 algorithm. */
          public static final int REVERSED_POLYNOMIAL = 0xA001;
          /**
           * Progress 4GL compatible <code>ENCODE</code> algorithm which takes a source byte array
           * (of any size) and generates a 16-byte one-way hash using a proprietary approach
           * combined with a variant of CRC-16.  The basic approach is to spread out the source data
           * in an intermediate accumulator array of 16 unsigned bytes and to repeatedly CRC that data
           * and copy the resulting 2-byte CRC into successive elements of this intermediate
           * accumulator, while building the next CRC results on the accumulated CRC value and the
           * recently modified intermediate accumulator.  This set of CRCs is calculated 8 times
           * (since that will generate 16 bytes).  This basic approach is executed 5 times in a row.
           * Each byte of the resulting hashed intermediate accumulator array is then translated into
           * one of the 52 possible uppercase or lowercase English alphabetic characters. That
           * translated result is returned.
           * <p>
           * The detailed algorithm:
           * <p>
           * <ol>
           *   <li> Initialize storage that can hold at least 2 unsigned bytes to 0x11. This is the
           *        accumulator for the calculated CRC.
           *   <li> Allocate storage that can hold 16 unsigned bytes.  This will be used as an
           *        intermediate output accumulator which will combine the input data and the
           *        calculated CRC data in a manner that will result in each of the 16 bytes having a
           *        value between 0 and 255 inclusive.
           *   <li> Iterate the following steps 5 times:
           *      <ol>
           *         <li>Walk forward through the array of source data provided and XOR each byte into
           *             the intermediate output accumulator (target array), but with the index of the
           *             target matching a backwards walk. Since the source array can be of any length,
           *             the lower index positions in the target may not ever get source data XOR'd.
           *             Source arrays longer than the target array cause wrapping and that means that
           *             the higher index positions in the target array will have data from multiple
           *             source positions XOR'd. See the {@link #sourceToTargetXor} method for details
           *             as well as the significant cryptographic limitations this causes.
           *         <li>Iterate 8 times, doing the following:
           *            <ol>
           *               <li> Call the {@link #crc16} method, passing in both the intermediate output
           *                    accumulator as well as the CRC accumulator.  This is the core CRC
           *                    algorithm and it will calculate the CRC of the current state of the
           *                    intermediate output accumulator while including the initial state of
           *                    the CRC accumulator.  The resulting returned CRC result is assigned
           *                    back to the CRC accumulator.  The CRC-16 algorithm used is not
           *                    cryptographically sound.  It generates a non-uniform distribution of
           *                    hashed data and collisions are not rare.
           *               <li> The least significant byte (byte 0) of the CRC accumulator will be
           *                    directly stored into an index position in the intermediate output
           *                    accumulator. This index position starts at 0 and increments by 2 for
           *                    every iteration of the nearest containing loop.  Since the intermediate
           *                    output accumulator has 16 elements, that is why the containing loop
           *                    iterates only 8 times.  This byte must be treated as an unsigned value.
           *               <li> The next to least significant byte (byte 1) of the CRC accumulator will
           *                    be directly stored into an index position in the intermediate output
           *                    accumulator. This index position starts at 1 and increments by 2 for
           *                    every iteration of the nearest containing loop. This byte must be
           *                    treated as an unsigned value.
           *            </ol>
           *      </ol>
           *   <li> At this point the intermediate output accumulator contains 16 poorly distributed
           *        hashed bytes.  Translate each byte of the intermediate output accumulator into
           *        a byte from the valid output character set.  Valid output characters include all
           *        English alphabetic letters (both uppercase and lowercase), for 52 possible output
           *        characters.  See the {@link #translateToAlpha} method for details on this process
           *        and for the severe limitations of the resulting non-uniform distribution.
           * </ol>
           * <p>
           * The result of this algorithm is NOT cryptographically sound.  It should not be used for
           * secure purposes.  The flaws in this algorithm come from the processing associated with
           * {@link #sourceToTargetXor}, {@link #crc16} and {@link #translateToAlpha}.
+          *
           * @param    data
           *           The data to be encoded.  This must not be <code>null</code>.  The array may be
           *           of any length.
+          *
           * @return   16 encoded bytes where each byte is one of the 52 possible output characters.
           */
          public static byte[] encode(int[] data)
+         {
             short[] storage = new short[OUTPUT_ARRAY_SIZE];
             int crc    = INITIAL_CRC_VALUE;
             int passes = CORE_LOOP_PASSES;
             // the input data is truncated at the first encountered null character
             for (int n = 0; n < data.length; n++)
+            {
                if (data[n] == 0)
+               {
                   data = Arrays.copyOf(data, n);
                   break;
+               }
+            }
             while (passes != 0)
+            {
                sourceToTargetXor(data, storage);
                int idx = 0;
                while (idx < storage.length)
+               {
                   crc = crc16(storage, crc);
                   storage[idx++] = (byte)(crc & 0xFF);
                   storage[idx++] = (byte)((crc >> 8) & 0xFF);
+               }
                passes--;
+            }
             return translateToAlpha(storage);
+         }
          /**
           * Walk forward through the array of source data provided and XOR each byte into the target
           * array, but with the index of the target matching a backwards walk.  The source data can
           * be an array of any length and the target length is not expected to match.  The target
           * indexes are calculated modulo the length of the target array, which means that for a source
           * array longer than the target array, one or more target array elements will store data XOR'd
           * from more than one source source array element (i.e. the XOR processing will wrap around).
           * For a source array smaller than the target array, there will be some low index elements
           * of the target array which will not receive any XOR'd data.  Only in the case where the
           * two arrays are the same length will there be no wrapping and no unmerged elements in the
           * target array.  Only in the case where the source array is modulo the size of the target
           * array will all elements of the target array be modified.
           * <p>
           * From a cryptographic perspective, the algorithm is quite poor.  Source arrays smaller
           * than the target arrays will leave target array bytes unmodified.  Source arrays larger
           * than the target array size will have some (or all) elements modified multiple times.
           * Both cases are causes of concern.  Not modifying elements means that there is less
           * input provided for distribution of results.  Modifying elements more than once can
           * cause issues as well.  Consider the case where the same character appears in the source
           * array in the first byte (index 0) and in the byte modulo the size of the target.  Because
           * of wrapping, this character will be XOR'd twice into the same element of the target
           * array.  XOR is its own inverse.  If you XOR the same data into a byte an even number of
           * times, then the resulting byte will be unchanged. This will have the effect of reducing
           * the distribution of the resulting data for some inputs. This is not suitable for secure
           * purposes.
+          *
           * @param    source
           *           Source array to XOR from. Must not be <code>null</code>. May be of any length.
           * @param    target
           *           Target array to XOR into. Must not be <code>null</code>. May be of any length.
           */
          public static void sourceToTargetXor(int[] source, short[] target)
+         {
             int len = target.length;
             int max = len - 1;
             for (int i = 0; i < source.length; i++)
+            {
                target[max - (i % len)] ^= source[i] & 0xFF;
+            }
+         }
          /**
           * This implements a standard CRC-16 (also known as CRC-16-IBM or CRC-16-IBM) algorithm that
           * uses a reversed polynomial of 0xA001 and swapped byte ordering. Using a reversed polynomial
           * means it processes each byte's least significant bit first (it shifts the binary data to
           * the right). Using swapped byte ordering means that the input data (which is being treated
           * as an arbitrarily large binary number) must be processed from highest element to lowest
           * element, whereas normally one might consider the most significant byte to be in the highest
           * array index position, this algorithm assumes the opposite.
           * <p>
           * If this algorithm generated uniformly distributed hashes, then in a best case scenario the
           * probability of a collision between any 2 items is (1 / 2^16) or .00152588 %.  That seems
           * good until one considers that between any 300 items there is a 50% chance of collisions
           * and between any 430 items there is a 75% chance of a collision!  This is the well known
           * birthday problem (see http://en.wikipedia.org/wiki/Birthday_attack).
           * <p>
           * Please note that the CRC-16 algorithm is not suitable for cryptographic hashing.  It does
           * NOT generate uniformly distributed hashes.  This means that the collision rate is not even
           * as good as the best case scenario.  Even if it did have uniform distribution, the small
           * number of bits means that collisions are not very rare.  CRC is more suitable for error
           * detection.  It should NOT be used for secure purposes.
+          *
           * @param    data
           *           The bytes of data to CRC.
           * @param    crc
           *           The initial CRC value into which each element of the data will be XOR'd.
+          *
           * @return   The calculated CRC value after factoring in all data bytes and binary dividing
           *           the polynomial.
           */
          public static int crc16(short[] data, int crc)
+         {
             // iterate from the top of the input array to the bottom
             for (int idx = (data.length - 1); idx >= 0; idx--)
+            {
                // XOR the input data into the crc
                crc ^= (int) (data[idx] & 0xFF);
                int bit = 7;
                // process each bit
                while (bit >= 0)
+               {
                   // check if the least significant bit is set (must be done before shifting)
                   boolean lsb = ((crc & 0x01) == 1);
                   // shift the data right by one bit
                   crc >>= 1;
                   if (lsb)
+                  {
                      // the least significant bit was set, XOR the polynomial into the crc
                      crc ^= REVERSED_POLYNOMIAL;
+                  }
                   bit--;
+               }
+            }
             return crc;
+         }
          /**
           * Convert the source array into an array of bytes with the same number of elements, but
           * where each byte may only contain uppercase or lowercase English alphabetic characters
           * (a-z and A-Z).
           * <p>
           * The source array will have each element translated into a corresponding element in the
           * output array.  Only the least significant byte of the source array element is considered
           * in the translation algorithm.  The order of the elements in the output array will be
           * the same order as the source array (e.g. the first source element translates to the
           * first output element and so forth).
           * <p>
           * The following algorithm is used to translate the 256 possible source byte values into
           * the 52 possible output byte values:
           * <p>
           * <ol>
           *   <li> If the least significant 7 bits of the source byte are one of the 52 possible
           *        English alphabetic characters, then that is the resulting byte.  This will
           *        yield a byte with 0x41 ('A') through 0x5A ('Z') or 0x61 ('a') - 0x7A ('z'),
           *        inclusive.
           *   <li> Otherwise, the most significant (upper) nibble of the source byte will be used
           *        to select a character from 'a' through 'q'.  This nibble can only have one of
           *        16 possible values (0 through 15).  This nibble value is used as a direct
           *        index into the 16 possible lowercase output letters.  Thus, a nibble value of
           *        0 (0x0) will yield 'a' (0x61), 1 (0x1) will yield 'b' (0x62) and so on, with
           *        the last possible value of 15 (0xF) yielding 'p' (0x70).
           * </ol>
           * <p>
           * Assuming the source bytes are uniformly distributed, this translation approach is
           * guaranteed to result in a highly UNEVEN distribution of output bytes.  This is
           * cryptographically BAD and should NOT be used for secure purposes. To understand
           * why this occurs:
           * <p>
           * <pre>
           * Source Byte Range    Output Byte                               Distribution Notes
           * -----------------    -----------    ----------------------------------------------------------------------------
           * 0x00 - 0x40          'a' - 'e'      'a' through 'd' are 16X more likely than 'e'
           * 0x41 - 0x5A          'A' - 'Z'      all equally likely (1 to 1 mapping of input and output)
           * 0x5B - 0x60          'f' - 'g'      'f' 5X more likely than 'g'
           * 0x61 - 0x7A          'a' - 'z'      all equally likely (1 to 1 mapping of input and output)
           * 0x7B - 0xC0          'h' - 'm'      'i' through 'l' are 16X more likely than 'm', 'h' is 5X more likely than 'm'
           * 0xC1 - 0xDA          'A' - 'Z'      all equally likely (1 to 1 mapping of input and output)
           * 0xDB - 0xE0          'n' - 'o'      'n' 5X more likely than 'o'
           * 0xE1 - 0xFA          'a' - 'z'      all equally likely (1 to 1 mapping of input and output)
           * 0xFB - 0xFF          'p'            only 'p' is possible
           * </pre>
           * <p>
           * More graphically, here is the exact distribution that will occur for perfectly distributed input:
           * <p>
           * <pre>
           * Character   Frequency          Histogram
           * ---------   ---------   -----------------------
           * A           2           ++
           * B           2           ++
           * C           2           ++
           * D           2           ++
           * E           2           ++
           * F           2           ++
           * G           2           ++
           * H           2           ++
           * I           2           ++
           * J           2           ++
           * K           2           ++
           * L           2           ++
           * M           2           ++
           * N           2           ++
           * O           2           ++
           * P           2           ++
           * Q           2           ++
           * R           2           ++
           * S           2           ++
           * T           2           ++
           * U           2           ++
           * V           2           ++
           * W           2           ++
           * X           2           ++
           * Y           2           ++
           * Z           2           ++
           * a           18          ++++++++++++++++++
           * b           18          ++++++++++++++++++
           * c           18          ++++++++++++++++++
           * d           18          ++++++++++++++++++
           * e           3           +++
           * f           7           +++++++
           * g           3           +++
           * h           7           +++++++
           * i           18          ++++++++++++++++++
           * j           18          ++++++++++++++++++
           * k           18          ++++++++++++++++++
           * l           18          ++++++++++++++++++
           * m           3           +++
           * n           7           +++++++
           * o           3           +++
           * p           7           +++++++
           * q           2           ++
           * r           2           ++
           * s           2           ++
           * t           2           ++
           * u           2           ++
           * v           2           ++
           * w           2           ++
           * x           2           ++
           * y           2           ++
           * z           2           ++
           * </pre>
           * <p>
           * This non-uniform distribution is purely due to the fallback approach of using the
           * upper nibble of some bytes as a selector into a subset of the possible output values.
           * Since the only a subset of the output values are targeted, it makes those values
           * more likely to occur.  In addition, because there are some upper nibble values that
           * are less likely to be encountered (because they rarely trigger the fallback mechanism
           * in the first place), this causes an additional level of non-uniformity even within
           * the subset that is possible to be selected.
+          *
           * @param    source
           *           The bytes to convert.  Must not be <code>null</code>.  Only the least
           *           significant byte of each element will be used.
+          *
           * @return   The converted array of bytes, with one element for each corresponding
           *           element of the source array.
           */
          public static byte[] translateToAlpha(short[] source)
+         {
             byte[] target = new byte[source.length];
             for (int idx = 0; idx < source.length; idx++)
+            {
                target[idx] = (byte)(source[idx] & 0x7F);
                if ((target[idx] < 'A' || (target[idx] > 'Z' && target[idx] < 'a') || target[idx] > 'z'))
+               {
                   target[idx] = (byte)(((source[idx] & 0xFF) >> 4) + 'a');
+               }
+            }
             return target;
+         }
          /**
           * Print the command line syntax help.
           */
          private static void syntax()
+         {
             System.out.println("Syntax: java CustomCRC16Encoder <mode> " +
                                "{<filename_of_binary_encoded_input_strings> | " +
                                "<input_string_to_hash>} [<input_string_to_hash> ...]");
             System.out.println("Where: <mode> is B (for binary file input), " +
                                "N (no-nulls binary file input), " +
                                "T (text file input) or A (for argument mode)");
+         }
          /**
           * Command line test program.
           */
          public static void main(String[] args)
+         {
             if (args.length < 2)
+            {
                syntax();
                System.exit(-1);
+            }
             boolean honorNull = true;
             if ("n".equalsIgnoreCase(args[0]))
+            {
                args[0] = "b";
                honorNull = false;
+            }
             if ("b".equalsIgnoreCase(args[0]))
+            {
                // read strings from a binary file
                InputStream in = null;
                try
+               {
                   File file = new File(args[1]);
                   if (!file.exists() || file.isDirectory())
+                  {
                      syntax();
                      System.exit(-2);
+                  }
                   in = new BufferedInputStream(new FileInputStream(file));
                   int    remain = (int) file.length();
                   int    idx    = 0;
                   while (idx < remain)
+                  {
                      // read 1 byte that tells us the size of the following binary string
                      int sz = in.read();
                      idx++;
                      int[] data = new int[sz];
                      int nulls = 0;
                      // empty string is size 0
                      for (int i = 0; i < sz; i++)
+                     {
                         // read in each byte (the value needs to be unsigned so we use int)
                         int next = in.read();
                         // encode input in the 4GL is a character type which can only have embedded
                         // nulls chars using get-string() in a specific way, if you are using chr()
                         // or other forms then nulls won't be there and those null characters result
                         // in an empty string; this behavior is controlled with a flag to allow
                         // testing both ways
                         if (!honorNull && next == 0)
+                        {
                            nulls++;
                            continue;
+                        }
                         data[i - nulls] = next;
+                     }
                      // remove the unused array elements
                      if (nulls > 0)
+                     {
                         data = Arrays.copyOf(data, (sz - nulls));
+                     }
                      idx += sz;
                      byte[] result = encode(data);
                      System.out.println(new String(result));
+                  }
+               }
                catch (IOException ioe)
+               {
                   ioe.printStackTrace();
+               }
                finally
+               {
                   try
+                  {
                      if (in != null)
                         in.close();
+                  }
                   catch (IOException ioe)
+                  {
                      // ignore
+                  }
+               }
+            }
             else if ("t".equalsIgnoreCase(args[0]))
+            {
                // read strings from a text file
                BufferedReader reader = null;
                try
+               {
                   File file = new File(args[1]);
                   if (!file.exists() || file.isDirectory())
+                  {
                      syntax();
                      System.exit(-2);
+                  }
                   reader = new BufferedReader(new FileReader(file));
                   String next = reader.readLine();
                   while (next != null)
+                  {
                      int len = next.length();
                      int[] data = new int[len];
                      for (int i = 0; i < len; i++)
+                     {
                         data[i] = (int) next.charAt(i);
+                     }
                      byte[] result = encode(data);
                      System.out.println(new String(result));
                      // read 1 line w/o CR or LF
                      next = reader.readLine();
+                  }
+               }
                catch (IOException ioe)
+               {
                   ioe.printStackTrace();
+               }
                finally
+               {
                   try
+                  {
                      if (reader != null)
                         reader.close();
+                  }
                   catch (IOException ioe)
+                  {
                      // ignore
+                  }
+               }
+            }
             else
+            {
                // strings are encoded in args
                for (int i = 1; i < args.length; i++)
+               {
                   byte[] txt  = args[i].getBytes();
                   int[]  data = new int[txt.length];
                   for (int k = 0; k < txt.length; k++)
+                  {
                      data[k] = txt[k];
+                  }
                   byte[] hashed = encode(data);
                   System.out.printf("%60s = %16s\n", "'" + args[i] + "'", new String(hashed));
+               }
+            }
+         }
+      }

Project

General

Profile

FWD » Core Development » Base Language

CustomCRC16Encoder.java