TextLexer (P2J - Progress 4GL to Java Conversion and Runtime)

java.lang.Object
- antlr.CharScanner
- - com.goldencode.p2j.preproc.TextLexer

All Implemented Interfaces:

antlr.TokenStream, PreprocTokenTypes
```
public class TextLexer
extends antlr.CharScanner
implements PreprocTokenTypes, antlr.TokenStream
```
Tokenizes the input stream of characters from the Progress source file and returns tokens to the caller according to the needs of preprocessor.
The lexer recognizes the current context such as comments, strings or code and keeps track of the context switches using Environment.setInComment(boolean) and Environment.setInString(boolean) methods.

See Also:

TextParser, Environment

Field Summary

Fields
Modifier and Type	Field and Description
`static antlr.collections.impl.BitSet`	`_tokenSet_0`
`static antlr.collections.impl.BitSet`	`_tokenSet_1`
`static antlr.collections.impl.BitSet`	`_tokenSet_2`
`static antlr.collections.impl.BitSet`	`_tokenSet_3`
`static antlr.collections.impl.BitSet`	`_tokenSet_4`
`static antlr.collections.impl.BitSet`	`_tokenSet_5`
`private boolean`	`brokenStringMatching` Control matched quote processing in strings.
`private int`	`commentNesting` Nesting level for comments.
`private Environment`	`env` keeps the reference to the shared environment

Fields inherited from class antlr.CharScanner
_returnToken, caseSensitive, caseSensitiveLiterals, commitToPath, EOF_CHAR, hashString, inputState, literals, saveConsumedInput, tabsize, text, tokenObjectClass, traceDepth

Fields inherited from interface com.goldencode.p2j.preproc.PreprocTokenTypes
AELSE, AELSEIF, AENDIF, AGLOBAL, AIF, ALPHA, AMESSAGE, AMPER, APOST, ARESUME, ASCOPED, ASTMT, ASTRING, ASUSPEND, ATHEN, AUNDEFINE, CODE, COMM_CLOSE, COMM_OPEN, COMMENT, DIGIT, DIGITS, EOF, EQUALS, LBRACE, NL, NULL_TREE_LOOKAHEAD, PPNAME, QSTRING, QUOTE, RBRACE, SPECIAL, STAR, STRING, TAB, WS, XAPOST, XQUOTE

Constructor Summary

Constructors
Constructor and Description
`TextLexer(Environment env)` Constructor.
`TextLexer(antlr.InputBuffer ib)`
`TextLexer(java.io.InputStream in)`
`TextLexer(antlr.LexerSharedInputState state)`
`TextLexer(java.io.Reader in)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`mASTMT(boolean _createToken)` Matches any ampersand prefaced symbolic name including all recognized or unrecognized preprocessor directives.
`protected void`	`mASTRING(boolean _createToken)` Matches an opening single quote, arbitrary contents and an ending single quote.
`void`	`mCODE(boolean _createToken)` Matches a preprocessor input that has no other interpretation.
`protected void`	`mCOMM_CLOSE(boolean _createToken)` Matches the closing sequence for Progress 4GL comments.
`protected void`	`mCOMM_OPEN(boolean _createToken)` Matches the opening sequence for Progress 4GL comments.
`protected void`	`mCOMMENT(boolean _createToken)` Matches Progress language comments, possibly nested.
`private static long[]`	`mk_tokenSet_0()`
`private static long[]`	`mk_tokenSet_1()`
`private static long[]`	`mk_tokenSet_2()`
`private static long[]`	`mk_tokenSet_3()`
`private static long[]`	`mk_tokenSet_4()`
`private static long[]`	`mk_tokenSet_5()`
`void`	`mNL(boolean _createToken)` Matches a single newline character.
`protected void`	`mQSTRING(boolean _createToken)` Matches an opening double quote, arbitrary contents and an ending double quote.
`void`	`mSTRING(boolean _createToken)` Matches Progress language strings.
`void`	`mWS(boolean _createToken)` Matches any amount of continguous whitespace (spaces and tabs) in a program.
`antlr.Token`	`nextToken()`
`void`	`tab()` Expands tabs with spaces right in the ANTLRStringBuilder text.
`int`	`testLiteralsTable(int ttype)` Tests the token text against the literals table and provides correct token types for the abbreviated preprocessor statement keywords.

Methods inherited from class antlr.CharScanner
append, append, commit, consume, consumeUntil, consumeUntil, getCaseSensitive, getCaseSensitiveLiterals, getColumn, getCommitToPath, getFilename, getInputBuffer, getInputState, getLine, getTabSize, getText, getTokenObject, LA, makeToken, mark, match, match, match, matchNot, matchRange, newline, panic, panic, reportError, reportError, reportWarning, resetText, rewind, setCaseSensitive, setColumn, setCommitToPath, setFilename, setInputState, setLine, setTabSize, setText, setTokenObjectClass, testLiteralsTable, toLower, traceIn, traceIndent, traceOut, uponEOF

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - env
```
private Environment env
```
    keeps the reference to the shared environment
  - commentNesting
```
private int commentNesting
```
    Nesting level for comments.
  - brokenStringMatching
```
private boolean brokenStringMatching
```
    Control matched quote processing in strings.
  - _tokenSet_0
```
public static final antlr.collections.impl.BitSet _tokenSet_0
```
  - _tokenSet_1
```
public static final antlr.collections.impl.BitSet _tokenSet_1
```
  - _tokenSet_2
```
public static final antlr.collections.impl.BitSet _tokenSet_2
```
  - _tokenSet_3
```
public static final antlr.collections.impl.BitSet _tokenSet_3
```
  - _tokenSet_4
```
public static final antlr.collections.impl.BitSet _tokenSet_4
```
  - _tokenSet_5
```
public static final antlr.collections.impl.BitSet _tokenSet_5
```
- Constructor Detail
  - TextLexer
```
public TextLexer(Environment env)
```
    Constructor. Creates a lexer attached to the input stream taken from the environment. Saves the environment for future needs.
    
    Parameters:
    
    env - Shared preprocessor environment.
  - TextLexer
```
public TextLexer(java.io.InputStream in)
```
  - TextLexer
```
public TextLexer(java.io.Reader in)
```
  - TextLexer
```
public TextLexer(antlr.InputBuffer ib)
```
  - TextLexer
```
public TextLexer(antlr.LexerSharedInputState state)
```
- Method Detail
  - tab
```
public void tab()
```
    Expands tabs with spaces right in the ANTLRStringBuilder text. This version is used here because according to experiments, the tabs expansion for the preprocessor variables is immediate.
    
    Overrides:
    
    tab in class antlr.CharScanner
  - testLiteralsTable
```
public int testLiteralsTable(int ttype)
```
    Tests the token text against the literals table and provides correct token types for the abbreviated preprocessor statement keywords.
    
    Overrides:
    
    testLiteralsTable in class antlr.CharScanner
  - nextToken
```
public antlr.Token nextToken()
                      throws antlr.TokenStreamException
```
    Specified by:
    
    nextToken in interface antlr.TokenStream
    
    Throws:
    
    antlr.TokenStreamException
  - mWS
```
public final void mWS(boolean _createToken)
               throws antlr.RecognitionException,
                      antlr.CharStreamException,
                      antlr.TokenStreamException
```
    Matches any amount of continguous whitespace (spaces and tabs) in a program. Newlines are NOT matched.
    This is a top level lexer rule which means that there is an associated WS token.
    
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mNL
```
public final void mNL(boolean _createToken)
               throws antlr.RecognitionException,
                      antlr.CharStreamException,
                      antlr.TokenStreamException
```
    Matches a single newline character.
    This is a top level lexer rule which means that there is an associated NL token.
    
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mSTRING
```
public final void mSTRING(boolean _createToken)
                   throws antlr.RecognitionException,
                          antlr.CharStreamException,
                          antlr.TokenStreamException
```
    Matches Progress language strings. Brackets the string with a pair of calls to Environment.setInString(boolean) so the preprocessor always knows the context.
    Depending on -keeptildes preprocessor options, newlines may be required to be deleted from the strings. This requirement does not extend to the characters which are coded as escape sequences, though. The final action in this rule postprocesses the string contents so they comply with the requirement.
    To tell the original newlines from the escaped ones, ClearStream uses the following technique. For every converted character, that becomes newline, an escape character is inserted in front of it into the stream. The postprocessing action is supposed to detect this escape character, discard it but leave the newline that follows untouched. The original newlines come without escape characters and are deleted.
    For this technique to work, an escape character has to be assigned which normally can't be seen in the stream. Such a character is CR, due to the fact that ClearStream strips off all original CRs and replaces them with NLs. But the same problem of converted CRs still exists. Fortunately, it is solved in a similar way. As the result, the escape character, CR, is inserted in front of every converted CR or NL. The valid escape sequences that ClearStream produces are CR CR and CR NL. In both cases, the first CR has to be removed from the stream and the second character left untouched.
    One more complication is due to the fact that there is no on time state change signalling between the lexer and the ClearStream code. Although this rule signals the end of string by calling inString(false), the call is one character late because of the k=2 lookahead in the lexer. Late signalling leaves a hole in the algorithm for a wrong escape character insertion just outside the string, which should not happen. Consider the following stream:
```
    "..."~n~n
 
```
    Due to the late signalling, ClearStream reads and converts the ~n Progress escape sequence before this rule signals the end of string. As the result, ClearStream produces the following stream:
```
    "..." CR NL NL
 
```
    Notice that only the first NL gets escaped with CR, the second NL comes when the end of string signal has been received. This issue is fixed by checking the next character in the lookahead buffer LA(1) in the postprocessing rule. If it is CR, it has to be discarded, as it only can be there as the result of this late signalling.
    This is a top level lexer rule which means that there is an associated STRING token.
    If there is an escaped null character in the string literal, it must be "space-ified" by converting it and all following characters into spaces (including the proper reduction of other following escape sequences into a single space character). One can encode a null escape sequence in a string literal in Progress BUT a null character will NEVER end up in the result! This method duplicates that processing.
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mASTRING
```
protected final void mASTRING(boolean _createToken)
                       throws antlr.RecognitionException,
                              antlr.CharStreamException,
                              antlr.TokenStreamException
```
    Matches an opening single quote, arbitrary contents and an ending single quote. Two continguous single quote characters and an escape prefixed single quote character are accepted as contents (they do not terminate the string).
    Any newlines inside the string are identified and the lexer's internal newline counter is properly maintained. Note that all such newlines are maintained in the output string because these are the escaped chars that have been left behind by the preprocessor. If such characters are not escaped, then the preprocessor removes such chars (carriage returns and line feeds). This is how Progress 4GL handles these characters. This means that in a raw source file that has not been preprocessed, a string literal can be split across any number of lines and the Progress preprocessor will put the string back together, ignoring the carriage returns and newlines.
    Tilde and the backslash can BOTH be Progress escape characters and as such, they need special attention. Backslash is only honored if UNIX escapes mode is set on. In particular, this rule must separately match 3 or 6 constructs as part of a string depending on which escape sequences are honored.
```
    ~'
    \'
    ~~
    ~\
    \\
    \~
    ~
    \
 
```
    The first 2 are a way of embedding a quote in a string. The next 4 are important because if this rule were to encounter '~~', '\\', '~\' or '\~' (which are valid strings) the rule would consume the first escape and then encounter an escaped quote which would not terminate the string properly, leading to a non-ending string. So this rule matches on any escaped escape char to eliminate this situation. Finally, a single ~ or \ that is not followed by a " or a duplicate escape char is matched as a single character. This is required (one may think that the closure rule should handle this case) because when ANTLR sees a ~ or \ in the leftmost position of the alternatives, it DOES NOT include it in the list of 'everything that is not a closing quote'. Likewise \n is not included. Of course, these tilde constructions must be placed in a specific order and if so, the ambiguity warnings that ANTLR reports can be disabled.
    The greedy option does not need to be disabled here as the closure rule termination is built into the subrule itself: it accepts anything that isn't an unescaped single quote character (see above).
    Tabs and spaces are maintained inside strings.
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mQSTRING
```
protected final void mQSTRING(boolean _createToken)
                       throws antlr.RecognitionException,
                              antlr.CharStreamException,
                              antlr.TokenStreamException
```
    Matches an opening double quote, arbitrary contents and an ending double quote. Two continguous double quote characters and an escape prefixed double quote character are accepted as contents (they do not terminate the string).
    Any newlines inside the string are identified and the lexer's internal newline counter is properly maintained. Note that all such newlines are maintained in the output string because these are the escaped chars that have been left behind by the preprocessor. If such characters are not escaped, then the preprocessor removes such chars (carriage returns and line feeds). This is how Progress 4GL handles these characters. This means that in a raw source file that has not been preprocessed, a string literal can be split across any number of lines and the Progress preprocessor will put the string back together, ignoring the carriage returns and newlines.
    Tilde and the backslash can BOTH be Progress escape characters and as such, they need special attention. Backslash is only honored if UNIX escapes mode is set on. In particular, this rule must separately match 3 or 6 constructs as part of a string depending on which escape sequences are honored.
```
    ~"
    \"
    ~~
    ~\
    \\
    \~
    ~
    \
 
```
    The first 2 are a way of embedding a quote in a string. The next 4 are important because if this rule were to encounter '~~', '\\', '~\' or '\~' (which are valid strings) the rule would consume the first escape and then encounter an escaped quote which would not terminate the string properly, leading to a non-ending string. So this rule matches on any escaped escape char to eliminate this situation. Finally, a double ~ or \ that is not followed by a " or a duplicate escape char is matched as a double character. This is required (one may think that the closure rule should handle this case) because when ANTLR sees a ~ or \ in the leftmost position of the alternatives, it DOES NOT include it in the list of 'everything that is not a closing quote'. Likewise \n is not included. Of course, these tilde constructions must be placed in a specific order and if so, the ambiguity warnings that ANTLR reports can be disabled.
    The greedy option does not need to be disabled here as the closure rule termination is built into the subrule itself: it accepts anything that isn't an unescaped double quote character (see above).
    Tabs and spaces are maintained inside strings.
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mASTMT
```
public final void mASTMT(boolean _createToken)
                  throws antlr.RecognitionException,
                         antlr.CharStreamException,
                         antlr.TokenStreamException
```
    Matches any ampersand prefaced symbolic name including all recognized or unrecognized preprocessor directives. Such a statement starts with an '&' followed by letters and possibly the hyphen character.
    The resulting token's text will be compared with the literals table by the testLiteralsTable method of this class. The matching is done case-insensitively and a subset of the symbols can be matched with an abbreviated form. The token type replacement occurs as follows:
```
 Token Type   Matched Text        Minimum Abbreviation Chars
 -----------  ------------------  --------------------------
 AGLOBAL      &global-define      5
 ASCOPED      &scoped-define      5
 AMESSAGE     &message            n/a
 AUNDEFINE    &undefine           6
 AIF          &if                 n/a
 ATHEN        &then               n/a
 AELSEIF      &elseif             n/a
 AELSE        &else               n/a
 AENDIF       &endif              n/a
 ASUSPEND     &analyze-suspend    n/a
 ARESUME      &analyze-resume     n/a
 
```
    This is a top level lexer rule which means that in the case where the statement is unrecognized, there will be an associated ASTMT token.
    If the leading ampersand is not followed by a recognized statement, then the ampersand and any following alphabetic characters will be returned as a CODE token. It is also possible to return the ampersand by itself as a CODE token, if there are no following alphabetic characters. Both of these combinations are possible in certain 4GL parsing cases like shell commands the ampersand character is perfectly valid. Another example is that ampersands can appear inside a variable name like my-p&l-var. The 4GL is just insane.
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mCODE
```
public final void mCODE(boolean _createToken)
                 throws antlr.RecognitionException,
                        antlr.CharStreamException,
                        antlr.TokenStreamException
```
    Matches a preprocessor input that has no other interpretation. Combines into one token as many characters as possible without breaking other rules. Breaks at &, /, ", ', * and whitespace to avoid lexical non-determinism with other top-level rules.
    This is a top level lexer rule which means that there is an associated CODE token.
    
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mCOMMENT
```
protected final void mCOMMENT(boolean _createToken)
                       throws antlr.RecognitionException,
                              antlr.CharStreamException,
                              antlr.TokenStreamException
```
    Matches Progress language comments, possibly nested. Brackets the comments with a pair of calls to Environment.setInComment(boolean) so the preprocessor always knows the context.
    This is a top level lexer rule which means that there is an associated COMMENT token.
    
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mCOMM_OPEN
```
protected final void mCOMM_OPEN(boolean _createToken)
                         throws antlr.RecognitionException,
                                antlr.CharStreamException,
                                antlr.TokenStreamException
```
    Matches the opening sequence for Progress 4GL comments.
    
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mCOMM_CLOSE
```
protected final void mCOMM_CLOSE(boolean _createToken)
                          throws antlr.RecognitionException,
                                 antlr.CharStreamException,
                                 antlr.TokenStreamException
```
    Matches the closing sequence for Progress 4GL comments.
    
    Throws:
    
    antlr.RecognitionException
    
    antlr.CharStreamException
    
    antlr.TokenStreamException
  - mk_tokenSet_0
```
private static final long[] mk_tokenSet_0()
```
  - mk_tokenSet_1
```
private static final long[] mk_tokenSet_1()
```
  - mk_tokenSet_2
```
private static final long[] mk_tokenSet_2()
```
  - mk_tokenSet_3
```
private static final long[] mk_tokenSet_3()
```
  - mk_tokenSet_4
```
private static final long[] mk_tokenSet_4()
```
  - mk_tokenSet_5
```
private static final long[] mk_tokenSet_5()
```

Class TextLexer

Field Summary

Fields inherited from class antlr.CharScanner

Fields inherited from interface com.goldencode.p2j.preproc.PreprocTokenTypes

Constructor Summary

Method Summary

Methods inherited from class antlr.CharScanner

Methods inherited from class java.lang.Object

Field Detail

env

commentNesting

brokenStringMatching

_tokenSet_0

_tokenSet_1

_tokenSet_2

_tokenSet_3

_tokenSet_4

_tokenSet_5

Constructor Detail

TextLexer

TextLexer

TextLexer

TextLexer

TextLexer

Method Detail

tab

testLiteralsTable

nextToken

mWS

mNL

mSTRING

mASTRING

mQSTRING

mASTMT

mCODE

mCOMMENT

mCOMM_OPEN

mCOMM_CLOSE

mk_tokenSet_0

mk_tokenSet_1

mk_tokenSet_2

mk_tokenSet_3

mk_tokenSet_4

mk_tokenSet_5