public class TextLexer extends antlr.CharScanner implements PreprocTokenTypes, antlr.TokenStream
The lexer recognizes the current context such as comments, strings
or code and keeps track of the context switches using
and
Environment.setInComment(boolean)
methods.Environment.setInString(boolean)
TextParser
,
Environment
Modifier and Type | Field and Description |
---|---|
static antlr.collections.impl.BitSet |
_tokenSet_0 |
static antlr.collections.impl.BitSet |
_tokenSet_1 |
static antlr.collections.impl.BitSet |
_tokenSet_2 |
static antlr.collections.impl.BitSet |
_tokenSet_3 |
static antlr.collections.impl.BitSet |
_tokenSet_4 |
static antlr.collections.impl.BitSet |
_tokenSet_5 |
private boolean |
brokenStringMatching
Control matched quote processing in strings.
|
private int |
commentNesting
Nesting level for comments.
|
private Environment |
env
keeps the reference to the shared environment
|
_returnToken, caseSensitive, caseSensitiveLiterals, commitToPath, EOF_CHAR, hashString, inputState, literals, saveConsumedInput, tabsize, text, tokenObjectClass, traceDepth
AELSE, AELSEIF, AENDIF, AGLOBAL, AIF, ALPHA, AMESSAGE, AMPER, APOST, ARESUME, ASCOPED, ASTMT, ASTRING, ASUSPEND, ATHEN, AUNDEFINE, CODE, COMM_CLOSE, COMM_OPEN, COMMENT, DIGIT, DIGITS, EOF, EQUALS, LBRACE, NL, NULL_TREE_LOOKAHEAD, PPNAME, QSTRING, QUOTE, RBRACE, SPECIAL, STAR, STRING, TAB, WS, XAPOST, XQUOTE
Constructor and Description |
---|
TextLexer(Environment env)
Constructor.
|
TextLexer(antlr.InputBuffer ib) |
TextLexer(java.io.InputStream in) |
TextLexer(antlr.LexerSharedInputState state) |
TextLexer(java.io.Reader in) |
Modifier and Type | Method and Description |
---|---|
void |
mASTMT(boolean _createToken)
Matches any ampersand prefaced symbolic name including all recognized or
unrecognized preprocessor directives.
|
protected void |
mASTRING(boolean _createToken)
Matches an opening single quote, arbitrary contents and an ending single
quote.
|
void |
mCODE(boolean _createToken)
Matches a preprocessor input that has no other interpretation.
|
protected void |
mCOMM_CLOSE(boolean _createToken)
Matches the closing sequence for Progress 4GL comments.
|
protected void |
mCOMM_OPEN(boolean _createToken)
Matches the opening sequence for Progress 4GL comments.
|
protected void |
mCOMMENT(boolean _createToken)
Matches Progress language comments, possibly nested.
|
private static long[] |
mk_tokenSet_0() |
private static long[] |
mk_tokenSet_1() |
private static long[] |
mk_tokenSet_2() |
private static long[] |
mk_tokenSet_3() |
private static long[] |
mk_tokenSet_4() |
private static long[] |
mk_tokenSet_5() |
void |
mNL(boolean _createToken)
Matches a single newline character.
|
protected void |
mQSTRING(boolean _createToken)
Matches an opening double quote, arbitrary contents and an ending double
quote.
|
void |
mSTRING(boolean _createToken)
Matches Progress language strings.
|
void |
mWS(boolean _createToken)
Matches any amount of continguous whitespace (spaces and tabs) in a program.
|
antlr.Token |
nextToken() |
void |
tab()
Expands tabs with spaces right in the ANTLRStringBuilder text.
|
int |
testLiteralsTable(int ttype)
Tests the token text against the literals table and provides
correct token types for the abbreviated preprocessor statement
keywords.
|
append, append, commit, consume, consumeUntil, consumeUntil, getCaseSensitive, getCaseSensitiveLiterals, getColumn, getCommitToPath, getFilename, getInputBuffer, getInputState, getLine, getTabSize, getText, getTokenObject, LA, makeToken, mark, match, match, match, matchNot, matchRange, newline, panic, panic, reportError, reportError, reportWarning, resetText, rewind, setCaseSensitive, setColumn, setCommitToPath, setFilename, setInputState, setLine, setTabSize, setText, setTokenObjectClass, testLiteralsTable, toLower, traceIn, traceIndent, traceOut, uponEOF
private Environment env
private int commentNesting
private boolean brokenStringMatching
public static final antlr.collections.impl.BitSet _tokenSet_0
public static final antlr.collections.impl.BitSet _tokenSet_1
public static final antlr.collections.impl.BitSet _tokenSet_2
public static final antlr.collections.impl.BitSet _tokenSet_3
public static final antlr.collections.impl.BitSet _tokenSet_4
public static final antlr.collections.impl.BitSet _tokenSet_5
public TextLexer(Environment env)
env
- Shared preprocessor environment.public TextLexer(java.io.InputStream in)
public TextLexer(java.io.Reader in)
public TextLexer(antlr.InputBuffer ib)
public TextLexer(antlr.LexerSharedInputState state)
public void tab()
tab
in class antlr.CharScanner
public int testLiteralsTable(int ttype)
testLiteralsTable
in class antlr.CharScanner
public antlr.Token nextToken() throws antlr.TokenStreamException
nextToken
in interface antlr.TokenStream
antlr.TokenStreamException
public final void mWS(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
This is a top level lexer rule which means that there is an associated
WS
token.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
public final void mNL(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
This is a top level lexer rule which means that there is an associated
NL
token.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
public final void mSTRING(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
Environment.setInString(boolean)
so the preprocessor always knows the context.
Depending on -keeptildes preprocessor options, newlines may be required to be deleted from the strings. This requirement does not extend to the characters which are coded as escape sequences, though. The final action in this rule postprocesses the string contents so they comply with the requirement.
To tell the original newlines from the escaped ones, ClearStream uses the following technique. For every converted character, that becomes newline, an escape character is inserted in front of it into the stream. The postprocessing action is supposed to detect this escape character, discard it but leave the newline that follows untouched. The original newlines come without escape characters and are deleted.
For this technique to work, an escape character has to be assigned which normally can't be seen in the stream. Such a character is CR, due to the fact that ClearStream strips off all original CRs and replaces them with NLs. But the same problem of converted CRs still exists. Fortunately, it is solved in a similar way. As the result, the escape character, CR, is inserted in front of every converted CR or NL. The valid escape sequences that ClearStream produces are CR CR and CR NL. In both cases, the first CR has to be removed from the stream and the second character left untouched.
One more complication is due to the fact that there is no on time state
change signalling between the lexer and the ClearStream code. Although this
rule signals the end of string by calling inString(false)
, the
call is one character late because of the k=2 lookahead in the lexer. Late
signalling leaves a hole in the algorithm for a wrong escape character
insertion just outside the string, which should not happen. Consider the
following stream:
"..."~n~nDue to the late signalling, ClearStream reads and converts the ~n Progress escape sequence before this rule signals the end of string. As the result, ClearStream produces the following stream:
"..." CR NL NLNotice that only the first NL gets escaped with CR, the second NL comes when the end of string signal has been received. This issue is fixed by checking the next character in the lookahead buffer LA(1) in the postprocessing rule. If it is CR, it has to be discarded, as it only can be there as the result of this late signalling.
This is a top level lexer rule which means that there is an associated
STRING
token.
If there is an escaped null
character in the string literal,
it must be "space-ified" by converting it and all following characters
into spaces (including the proper reduction of other following escape
sequences into a single space character). One can encode a
null
escape sequence in a string literal in Progress BUT a
null
character will NEVER end up in the result! This method
duplicates that processing.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
protected final void mASTRING(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
Any newlines inside the string are identified and the lexer's internal newline counter is properly maintained. Note that all such newlines are maintained in the output string because these are the escaped chars that have been left behind by the preprocessor. If such characters are not escaped, then the preprocessor removes such chars (carriage returns and line feeds). This is how Progress 4GL handles these characters. This means that in a raw source file that has not been preprocessed, a string literal can be split across any number of lines and the Progress preprocessor will put the string back together, ignoring the carriage returns and newlines.
Tilde and the backslash can BOTH be Progress escape characters and as such, they need special attention. Backslash is only honored if UNIX escapes mode is set on. In particular, this rule must separately match 3 or 6 constructs as part of a string depending on which escape sequences are honored.
~' \' ~~ ~\ \\ \~ ~ \
The first 2 are a way of embedding a quote in a string. The next 4 are important because if this rule were to encounter '~~', '\\', '~\' or '\~' (which are valid strings) the rule would consume the first escape and then encounter an escaped quote which would not terminate the string properly, leading to a non-ending string. So this rule matches on any escaped escape char to eliminate this situation. Finally, a single ~ or \ that is not followed by a " or a duplicate escape char is matched as a single character. This is required (one may think that the closure rule should handle this case) because when ANTLR sees a ~ or \ in the leftmost position of the alternatives, it DOES NOT include it in the list of 'everything that is not a closing quote'. Likewise \n is not included. Of course, these tilde constructions must be placed in a specific order and if so, the ambiguity warnings that ANTLR reports can be disabled.
The greedy option does not need to be disabled here as the closure rule termination is built into the subrule itself: it accepts anything that isn't an unescaped single quote character (see above).
Tabs and spaces are maintained inside strings.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
protected final void mQSTRING(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
Any newlines inside the string are identified and the lexer's internal newline counter is properly maintained. Note that all such newlines are maintained in the output string because these are the escaped chars that have been left behind by the preprocessor. If such characters are not escaped, then the preprocessor removes such chars (carriage returns and line feeds). This is how Progress 4GL handles these characters. This means that in a raw source file that has not been preprocessed, a string literal can be split across any number of lines and the Progress preprocessor will put the string back together, ignoring the carriage returns and newlines.
Tilde and the backslash can BOTH be Progress escape characters and as such, they need special attention. Backslash is only honored if UNIX escapes mode is set on. In particular, this rule must separately match 3 or 6 constructs as part of a string depending on which escape sequences are honored.
~" \" ~~ ~\ \\ \~ ~ \
The first 2 are a way of embedding a quote in a string. The next 4 are important because if this rule were to encounter '~~', '\\', '~\' or '\~' (which are valid strings) the rule would consume the first escape and then encounter an escaped quote which would not terminate the string properly, leading to a non-ending string. So this rule matches on any escaped escape char to eliminate this situation. Finally, a double ~ or \ that is not followed by a " or a duplicate escape char is matched as a double character. This is required (one may think that the closure rule should handle this case) because when ANTLR sees a ~ or \ in the leftmost position of the alternatives, it DOES NOT include it in the list of 'everything that is not a closing quote'. Likewise \n is not included. Of course, these tilde constructions must be placed in a specific order and if so, the ambiguity warnings that ANTLR reports can be disabled.
The greedy option does not need to be disabled here as the closure rule termination is built into the subrule itself: it accepts anything that isn't an unescaped double quote character (see above).
Tabs and spaces are maintained inside strings.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
public final void mASTMT(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
'&'
followed by letters and possibly the hyphen character.
The resulting token's text will be compared with the literals table by
the testLiteralsTable
method of this class. The matching is
done case-insensitively and a subset of the symbols can be matched with
an abbreviated form. The token type replacement occurs as follows:
Token Type Matched Text Minimum Abbreviation Chars ----------- ------------------ -------------------------- AGLOBAL &global-define 5 ASCOPED &scoped-define 5 AMESSAGE &message n/a AUNDEFINE &undefine 6 AIF &if n/a ATHEN &then n/a AELSEIF &elseif n/a AELSE &else n/a AENDIF &endif n/a ASUSPEND &analyze-suspend n/a ARESUME &analyze-resume n/a
This is a top level lexer rule which means that in the case where the
statement is unrecognized, there will be an associated ASTMT
token.
If the leading ampersand is not followed by a recognized statement, then
the ampersand and any following alphabetic characters will be returned as a
CODE
token. It is also possible to return the ampersand by
itself as a CODE
token, if there are no following alphabetic
characters. Both of these combinations are possible in certain 4GL parsing
cases like shell commands the ampersand character is perfectly valid. Another
example is that ampersands can appear inside a variable name like my-p&l-var.
The 4GL is just insane.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
public final void mCODE(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
This is a top level lexer rule which means that there is an associated
CODE
token.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
protected final void mCOMMENT(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
Environment.setInComment(boolean)
so the preprocessor always knows the context.
This is a top level lexer rule which means that there is an associated
COMMENT
token.
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
protected final void mCOMM_OPEN(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
protected final void mCOMM_CLOSE(boolean _createToken) throws antlr.RecognitionException, antlr.CharStreamException, antlr.TokenStreamException
antlr.RecognitionException
antlr.CharStreamException
antlr.TokenStreamException
private static final long[] mk_tokenSet_0()
private static final long[] mk_tokenSet_1()
private static final long[] mk_tokenSet_2()
private static final long[] mk_tokenSet_3()
private static final long[] mk_tokenSet_4()
private static final long[] mk_tokenSet_5()