public final class Contains
extends java.lang.Object
Construct Precedence Description ----------------------------------------------------------------------------------------------- (...) 3 Parentheses to change precedence of a subexpression operation. word 3 A full word to match prefix 3 A prefix of a word to match, followed by an asterisk (*), indicating a wildcard. The asterisk may only be at the end of the word. & 2 Logical AND operator. | or ! or ^ 1 Logical OR operator.
Subexpressions may be nested arbitrarily deeply. An empty expression is silently tolerated, but does not produce a match.
Match tests are performed by constructing an instance of this class and invoking the
evaluate(String)
method against it. The match expression is parsed into an AST, and
the AST is walked to evaluate the expression for the data passed to the evaluate
method.
In practice, a separate instance of this class is constructed for every data match. This design is necessary because this class will be used from within a database user defined function, which is driven statelessly by a database in connection with a query. Thus, we have no control over when or how often the feature is used. To amortize the expense of parsing the match expression, a cache is used to permit re-use of previously parsed expression ASTs. Since ASTs may be shared across threads, they are treated as immutable.
NOTE: since this implementation does not leverage true word break tables and joins, it is not suitable for use with large tables nor tables with very large data entries. Since it is used with a UDF that will be opaque to a database's query planner, this implementation often will result in full table scans, unless other components of a query allow the query planner to make a more efficient plan. In addition, each data entry will be scanned for each invocation. Thus, this implementation is suitable for lightweight uses where the primary objective is to properly honor the syntax of the CONTAINS operator's match expression.
TODO: properly handle match expression characters Progress simply ignores when matching. Do these have to do with word break rules?
TODO: at this time, word break rules are not used to delimit words. A very simple single character delimiter is used to separate words in the data string.
Modifier and Type | Field and Description |
---|---|
private static java.util.Map<java.lang.String,java.util.List<LogicalExpressionConverter.EToken>> |
cache
Cache of match expressions to parsed ASTs
|
private boolean |
caseSensitive
Should matches be checked case-sensitively?
|
private java.util.Set<java.lang.String> |
delim
Single character word delimiter for data strings
|
private java.util.List<LogicalExpressionConverter.EToken> |
exprRPN
Parsed match expression RPN, used to evaluate matches with string data
|
Constructor and Description |
---|
Contains(java.lang.String expr,
boolean caseSensitive,
char... delim)
Constructor.
|
Contains(java.lang.String expr,
char... delim)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
boolean |
evaluate(java.lang.String data)
Given a data string, evaluate the expression to determine whether the data matches.
|
static void |
main(java.lang.String[] args)
Simple test harness for command line testing.
|
private java.lang.String |
nextWord(java.io.Reader reader)
Get the next word from the match string, using this object's delimiter.
|
java.util.Set<java.lang.String> |
words(java.lang.String data,
boolean forCaseInsensitive)
Split a string to a set of words
|
private static final java.util.Map<java.lang.String,java.util.List<LogicalExpressionConverter.EToken>> cache
private final java.util.Set<java.lang.String> delim
private final boolean caseSensitive
private final java.util.List<LogicalExpressionConverter.EToken> exprRPN
public Contains(java.lang.String expr, char... delim)
expr
- Match expression which uses CONTAINS operator syntax.delim
- A list of single-character word delimiter for data strings.ErrorConditionException
- if there is an I/O error parsing the match expression.public Contains(java.lang.String expr, boolean caseSensitive, char... delim)
expr
- Match expression which uses CONTAINS operator syntax.caseSensitive
- true
to perform word match comparisons case-sensitively, false
to
ignore case.delim
- A list of single-character word delimiter for data strings.ErrorConditionException
- if there is a syntax or an I/O error parsing the match expression.public boolean evaluate(java.lang.String data)
data
- A data string with words delimited by this object's specified delimiter.true
if the words in the data match this object's expression, else
false
.ErrorConditionException
- if there is an I/O error reading words from the data stream.public java.util.Set<java.lang.String> words(java.lang.String data, boolean forCaseInsensitive)
data
- string to be parsedforCaseInsensitive
- return words which are unique in case-insensitive senseprivate java.lang.String nextWord(java.io.Reader reader) throws java.io.IOException
TODO: implement proper word break logic.
reader
- String reader.null
if no more words are available.java.io.IOException
- if there is an I/O error reading the stream.public static void main(java.lang.String[] args)
args
- Expects the first argument to be the match expression and the second to be the
data to be matched.