Word Index and CONTAINS Operator¶
In 4GL, the CONTAINS
operator is implemented using word indices
. There is no direct counterpart for such indices in the relational databases, so in FWD, the operator is implemented using word tables
. In this section, the details about this are provided.
Word tables and related database objects¶
For the table {T} and field {f} with word index on it the word table is defined as (hereafter we provide SQL statements for the PostreSQL dialect):
create table {T}__{f} ( parent__id int8 not null, word text not null );
for non-extent {f}.
For extent {f}:
create table {T}__{f} ( parent__id int8 not null, list__index int4 not null, word text not null );
The updates if the word tables on inserts/updates of {T} is maintained by triggers that use the following auxiliary functions:
create or replace function words ( recid int8, txt text, toUpperCase boolean ) returns table ( parent int8, word text ) language 'plpgsql' as $$ declare arr text[]; declare w text; begin arr = words(txt, false, toUpperCase); foreach w in array arr loop parent = recid; word = case when touppercase then UPPER(w) else w end; return next; end loop; end $$; create or replace function words ( recid int8, listidx int4, txt text, toUpperCase boolean ) returns table ( parent int8, idx int4, word text ) language 'plpgsql' as $$ declare arr text[]; declare w text; begin arr = words(txt, false, toUpperCase); foreach w in array arr loop parent = recid; idx = listidx; word = case when touppercase then UPPER(w) else w end; return next; end loop; end $$;
Here
words(String data, boolean toUpperCase, boolean forCaseInsensitive)
is a PL/Java UDF for splitting text into words.
The trigger function for a table {T} and field {f} is:
create or replace function {T}__{f}__trg() returns trigger language plpgsql as $$ begin delete from {T}__{f} where {T}__{f}.parent__id = new.recid; insert into {T}__{f} select * from words(new.recid, new.{f}, true); return new; end; $$;
This function is used in the following triggers:
create trigger {T}__{f}__upd after update of {f} on {T} for each row execute procedure {T}__{f}__trg(); create trigger {T}__{f}__ins after insert on {T} for each row execute procedure {T}__{f}__trg();
If {f} is a normalized extent field, the trigger function is:
create or replace function ttwi__f_ext__trg() returns trigger language plpgsql as $$ begin delete from {T}__{f} where {T}__{f}.parent__id = new.parent__id and {T}__{f}.list__index = new.list__index; insert into {T}__{f} select * from words(new.parent__id, new.list__index, new.{f}, true); return new; end; $$;
The triggers are defined for the corresponding extent table {TE} = {T}__<extent size>
:
create trigger {T}__{f}__upd after update of {f} on {TE} for each row execute procedure {T}__{f}__trg(); create trigger {T}__{f}__ins after insert on {TE} for each row execute procedure {T}__{f}__trg();
For the denormalized extent field {f}, the situation is a little more complicated.
Let {i} be a position of the extent component and {fi} is the corresponding field name:
The trigger function for this field is:
create or replace function {T}__{fi}__trg() returns trigger language plpgsql as $$ begin delete from {T}__{f} where {T}__{f}.parent__id = new.recid and {T}__{f}.list__index = {i|; insert into {T}__{f} select * from words(new.recid, {i}, new.{T}__{fi}, true); return new; end; $$;
The UPDATE
trigger is defined for every {fi} field:
create trigger {T}__{fi}__upd after update of {fi} on {T} for each row execute procedure {T}__{fi}__trg();
The INSERT
trigger is one:
create trigger {T}__{f}__ins after insert on {T} for each row execute procedure {T}__{f}__trg();
It uses trigger function:
create or replace function {T}__{f}__trg() returns trigger language plpgsql as $$ begin delete from {T}__{f} where {T}__{f}.parent__id = new.recid; insert into {T}__{f} select * from words(new.recid, 1, new.{f1}, true); insert into {T}__{f} select * from words(new.recid, 2, new.{f2}, true); .... insert into {T}__{f} select * from words(new.recid, {n}, new.{fn}, true); return new; end; $$;
where
n
is the size of the extent.
We define the following indices/constrains for the word table _{f}.
For non-extent {f}:
alter table {T}_{f} add constraint pk__{T}_{f} primary key (parent__id, word); create index fkidx__{T}_{f} on {T}_{f} (parent__id); create index idx__{T}_{f} on {T}_{f} (word); alter table {T}_{f} add constraint fk__{T}_{f} foreign key (parent__id) references {T} on delete cascade;
For extent {f} (both normalized and denormalized):
alter table {T}_{f} add constraint pk__{T}_{f} primary key (parent__id, list__index, word); create index fkidx__{T}_{f} on {T}_{f} (parent__id); create index idx__{T}_{f} on {T}_{f} (word); alter table {T}_{f} add constraint fk__{T}__{f} foreign key (parent__id) references {T} on delete cascade;
The definition of triggers, indices, and constraints for all word tables in the database are located in a separate SQL script schema_word_tables_{db}_{dialect}.sql
.
On the database import, this script is applied after the population of tables. However, the script is idempotent and can be applied multiple times.
SQL re-writing.¶
FWD converts theCONTAINS
operator to the invocation of the CONTAINS
UDF. This solution has two drawbacks:
- In many situations, it results in a full table scan.
- It is difficult (if possible at all) to achieve the 4GL-compatible implicit sorting, which depends on terms satisfying the
CONTAINS
condition. The more close the term to the beginning of the condition expression, the more 'important' it is for sorting.
SQL-rewriting of the CONTAINS
UDF invocation using word tables addresses both these problems. Below we describe the details of the algorithm.
Consider the following SELECT
query:
SELECT <flist> FROM <from> WHERE <where> ORDER BY <order>
Where
<flist>
is the fields' list, <from>
is the the FROM
expression, <where>
is the WHERE
predicate, and <order>
is a list of fields used for ordering. Hereafter expressions in curly brackets are tables/fields names/aliases.
Let's assume that <where> contains a predicate
C = CONTAINS({T}.{f}, <expr>)
And the corresponding word index was selected for implicut ordering (at this moment it means the in the orofinal statement <order> is
{T}.recid
and C is the first CONTAINS
in <where> for {T}.{f}).
Using DNF, we can re-write <where>
as C[AND W1] [OR (NOT C)[AND W2]] [OR W3]
, where predicates W1, W2 W3 do not contain C.
The ordering described in #1587-407 is correct if W3 is absent or depends only on the other CONTAINS
. Otherwise, it can result in duplicated records in the result set since we add {cte}
to <from>, which means CROSS JOIN.
It is possible to add the analysis and re-writing of the <where> predicate, but it can take time. So I suggest the following algorithm.
Let <expr>
= >P(t1, t2, ..., tn)>
where terms ti
are ordered according to the first position where that found in <expr>
.
If the field {f}
is not an extent one the re-written query is
WITH {cte} AS ( SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn> AS {wn}, recid FROM {T} t JOIN {wtable} AS {walias1} ON ({walias1}.parent__id = t.recid AND (<or group1>)) JOIN {wtable} AS {walias2} ON ({walias2}.parent__id = t.recid AND (<or group2>)) .... JOIN {wtable} AS {waliask} ON (<walias2>.parent__id = t.recid AND (<or groupk>)) GROUP BY recid ), {mcte} AS ( SELECT <flist>, recid FROM <from> WHERE {where'}> ) SELECT <flist'> FROM {mcte} LEFT JOIN <cte> ON ({mcte}.recid = <cte>.recid) ORDER BY coalesce({cte}.{w1}, -1) desc, coalesce({cte}.{w2}, -1) desc, ..., coalesce({cte}.{wn}, -1) desc, {mcte}.recid
Here
<where'>
is <where>
predicate with C replaced by C' = {mcte}.recid IN (SELECT recid FROM {cte})
and <flist'>
is <flist>
with table aliases replaced with {mcte}.
<or groupi>
and <wej> are defined as following:
Let P(t1, t2,...) = (t11 | t12 | ...) & (t21 | t22 | ...) ... & (tk1 | tk2 | ...) is CNF Of P. Terms tij
an are one of ti
. Sub-terms tij
in every AND term are ordered according the ordering of ti
and AND terms are ordered lexicographically.<or groupi>
is ti1' OR ti2' OR ...
where tij'
is either waliasi}.word = tij
or waliasi}.word LIKE tij
.
The weight expression <wei> = SUM(CAST tpq' as int))
where tpq'
is the first term in CNF which corresponds to ti
.
If the field {f}
is an extent one the re-written query is:
WITH {cte} AS ( SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn> AS {wn}, recid FROM {T} t JOIN {ET} e ON (e.parent__id = .recid) JOIN {wtable} AS {walias1} ON ({walias1}.parent__id = t.recid AND {walias1}.list__index = e.list__index AND (<or group1>)) JOIN {wtable} AS {walias2} ON ({walias2}.parent__id = t.recid AND {walias2}.list__index = e.list__index AND (<or group2>)) .... JOIN {wtable} AS {waliask} ON (<walias2>.parent__id = t.recid AND {waliask}.list__index = e.list__index AND (<or groupk>)) GROUP BY recid ), {mcte} AS ( SELECT <flist>, recid FROM <from> WHERE {where'}> ) SELECT <flist'> FROM {mcte} LEFT JOIN <cte> ON ({mcte}.recid = <cte>.recid) ORDER BY coalesce({cte}.{w1}, -1) desc, coalesce({cte}.{w2}, -1) desc, ..., coalesce({cte}.{wn}, -1) desc, {mcte}.recid
Here
{ET}
is the extent table.
For _temp
and 'dirty' databases where CONTAINS
UDF is used the re-written query looks the same, but with different {cte}
. For non-extent field we use:
{cte} AS ( SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn> AS {wn}, recid FROM {T} t WHERE CONTAINS(t.{f}, expr) )
Here <wei>
= CONTAINS(t.{f}, ti)
.
For extent field:
{cte} AS ( SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn> AS {wn}, recid FROM {T} t JOIN {ET} e ON (e.parent__id = t.recid AND CONTAINS(e.{f}, expr)) GROUP BY recid )
Here <wei>
= SUM(CAST(CONTAINS(e.{f}, ti)) as int))
.
Sample re-written queries.¶
Consider the following 4GL table:
ADD TABLE "ttwi"
AREA "Schema Area"
DUMP-NAME "ttwi"
ADD FIELD "if1" OF "ttwi" AS character
FORMAT "x(128)"
INITIAL ""
POSITION 2
MAX-WIDTH 256
ORDER 10
ADD FIELD "if2" OF "ttwi" AS character
FORMAT "x(8)"
INITIAL ""
POSITION 3
MAX-WIDTH 16
ORDER 20
CASE-SENSITIVE
ADD FIELD "f-ext" OF "ttwi" AS character
FORMAT "x(50)"
INITIAL ""
POSITION 4
MAX-WIDTH 510
EXTENT 5
ORDER 30
ADD INDEX "wi1" ON "ttwi"
AREA "Schema Area"
WORD
INDEX-FIELD "if1" ASCENDING
ADD INDEX "wi2" ON "ttwi"
AREA "Schema Area"
INACTIVE
WORD
INDEX-FIELD "if2" ASCENDING
ADD INDEX "wiext" ON "ttwi"
AREA "Schema Area"
INACTIVE
WORD
INDEX-FIELD "f-ext" ASCENDING
Consider also the table ttwid
with the same structure but that is converted using denormalized extent field f-ext
.
Below we provide sample 4GL queries and corresponding re-written SQL statements:
for each ttwi where (if2 contains 'big & elephant | small & rabbit') or (if1 '(cat* | ant*)'):
¶
with wcte1 as ( select sum(cast(w1.word = 'big' as int)) as w11, sum(cast(w3.word = 'elephant' as int)) as w32, sum(cast(w1.word = 'small' as int)) as w13, sum(cast(w2.word = 'rabbit' as int)) as w24, recid from ttwi t join ttwi__if2 as w1 on (w1.parent__id = t.recid and (w1.word = 'big' or w1.word = 'small')) join ttwi__if2 as w2 on (w2.parent__id = t.recid and (w2.word = 'big' or w2.word = 'rabbit')) join ttwi__if2 as w3 on (w3.parent__id = t.recid and (w3.word = 'elephant' or w3.word = 'small')) join ttwi__if2 as w4 on (w4.parent__id = t.recid and (w4.word = 'elephant' or w4.word = 'rabbit')) group by recid ) , wcte2 as ( select distinct recid from ttwi t join ttwi__if1 as w1 on (w1.parent__id = t.recid and (w1.word like UPPER('cat%') or w1.word like UPPER('ant%'))) ) , mcte3 as ( select ttwi__impl0_.recid as col_0_0_, ttwi__impl0_.recid as col_recid_ from ttwi ttwi__impl0_ where (ttwi__impl0_.recid in (select recid from wcte1)) or (ttwi__impl0_.recid in (select recid from wcte2)) ) select mcte3.col_0_0_ from mcte3 left join wcte1 on (mcte3.col_recid_ = wcte1.recid) order by coalesce(wcte1.w11, -1) desc,coalesce(wcte1.w32, -1) desc,coalesce(wcte1.w13, -1) desc,coalesce(wcte1.w24, -1) desc,mcte3.col_recid_
for each ttwi where (f-ext contains ''big & elephant | small & rabbit') or (f-ext contains 'word22a & word22b | word23a & word23b'):
¶
with wcte1 as ( select sum(cast(w1.word like UPPER('word1%') as int)) as w11, sum(cast(w1.word like UPPER('word3%') as int)) as w12, sum(cast(w2.word = UPPER('word12c') as int)) as w23, sum(cast(w2.word = UPPER('word33c') as int)) as w24, recid from ttwi t join ttwi__5 e on (e.parent__id = t.recid) join ttwi__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index = e.list__index and (w1.word like UPPER('word1%') or w1.word like UPPER('word3%'))) join ttwi__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index = e.list__index and (w2.word = UPPER('word12c') or w2.word = UPPER('word33c'))) group by recid ) , mcte3 as ( select ttwi__impl0_.recid as col_0_0_, ttwi__impl0_.recid as col_recid_ from ttwi ttwi__impl0_ where (ttwi__impl0_.recid in (select recid from wcte1)) or (ttwi__impl0_.recid in ( select distinct recid from ttwi t join ttwi__5 e on (e.parent__id = t.recid) join ttwi__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index = e.list__index and (w1.word = UPPER('word22a') or w1.word = UPPER('word23a'))) join ttwi__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index = e.list__index and (w2.word = UPPER('word22a') or w2.word = UPPER('word23b'))) join ttwi__f_ext as w3 on (w3.parent__id = e.parent__id and w3.list__index = e.list__index and (w3.word = UPPER('word22b') or w3.word = UPPER('word23a'))) join ttwi__f_ext as w4 on (w4.parent__id = e.parent__id and w4.list__index = e.list__index and (w4.word = UPPER('word22b') or w4.word = UPPER('word23b'))) )) ) select mcte3.col_0_0_ from mcte3 left join wcte1 on (mcte3.col_recid_ = wcte1.recid) order by coalesce(wcte1.w11, -1) desc,coalesce(wcte1.w12, -1) desc,coalesce(wcte1.w23, -1) desc,coalesce(wcte1.w24, -1) desc,mcte3.col_recid_
for each ttwid where (f-ext[1] contains '(word1* | word3*) & (word12c | word33c)') or (f-ext contains 'word22a & word22b | word23a & word23b'):
¶
with wcte1 as ( select sum(cast(w1.word like UPPER('word1%') as int)) as w11, sum(cast(w1.word like UPPER('word3%') as int)) as w12, sum(cast(w2.word = UPPER('word12c') as int)) as w23, sum(cast(w2.word = UPPER('word33c') as int)) as w24, recid from ttwid t join (select distinct parent__id, list__index from ttwid__f_ext) e on (e.parent__id = t.recid) join ttwid__f_ext as w1 on (w1.parent__id = t.recid and (w1.word like UPPER('word1%') or w1.word like UPPER('word3%'))) join ttwid__f_ext as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('word12c') or w2.word = UPPER('word33c'))) group by recid ) , mcte3 as ( select ttwid__imp0_.recid as col_0_0_, ttwid__imp0_.recid as col_recid_ from ttwid ttwid__imp0_ where (ttwid__imp0_.recid in (select recid from wcte1)) or (ttwid__imp0_.recid in ( select distinct recid from ttwid t join (select distinct parent__id, list__index from ttwid__f_ext) e on (e.parent__id = t.recid) join ttwid__f_ext as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('word22a') or w1.word = UPPER('word23a'))) join ttwid__f_ext as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('word22a') or w2.word = UPPER('word23b'))) join ttwid__f_ext as w3 on (w3.parent__id = t.recid and (w3.word = UPPER('word22b') or w3.word = UPPER('word23a'))) join ttwid__f_ext as w4 on (w4.parent__id = t.recid and (w4.word = UPPER('word22b') or w4.word = UPPER('word23b'))) )) ) select mcte3.col_0_0_ from mcte3 left join wcte1 on (mcte3.col_recid_ = wcte1.recid) order by coalesce(wcte1.w11, -1) desc,coalesce(wcte1.w12, -1) desc,coalesce(wcte1.w23, -1) desc,coalesce(wcte1.w24, -1) desc,mcte3.col_recid_
Sample re-written queries with UDF.¶
Consider a temporary table ttwi
with the same structure as before. We cannot use word tables since the H2 support for temporary tables' triggers (which FWD uses for the _temp
database) is broken. However, the re-writing is still needed to support extent fields and implicit ordering. Below we provide a sample re-written SQL statement with UDF.
for each ttwi where (ifext[2] contains "wordc* | word*") or (if1 contains "x* | wrdc & word*"):
¶
with wcte1 as ( select sum(cast(contains_1(e.ifext, UPPER('wordc*')) as int)) as ww1, sum(cast(contains_1(e.ifext, UPPER('word*')) as int)) as ww2, recid from tt1 t join tt1__5 e on (e.parent__id = t.recid) where contains_1(e.ifext, UPPER('wordc* | word*'))group by t.recid) , mcte3 as ( select ttwi_1_1__0_.recid as id0_, ttwi_1_1__0_._multiplex as column1_0_, ttwi_1_1__0_._errorFlag as column2_0_, ttwi_1_1__0_._originRowid as column3_0_, ttwi_1_1__0_._errorString as column4_0_, ttwi_1_1__0_._peerRowid as column5_0_, ttwi_1_1__0_._rowState as column6_0_, ttwi_1_1__0_.if1 as if7_0_, ttwi_1_1__0_.recid as col_recid_ from tt1 ttwi_1_1__0_ where ttwi_1_1__0_._multiplex = ? and ((ttwi_1_1__0_.recid in (select recid from wcte1)) or contains_1(upper(if1), upper('x* | wrdc & word*'))) ) select mcte3.id0_, mcte3.column1_0_, mcte3.column2_0_, mcte3.column3_0_, mcte3.column4_0_, mcte3.column5_0_, mcte3.column6_0_, mcte3.if7_0_ from mcte3 left join wcte1 on (mcte3.col_recid_ = wcte1.recid) order by coalesce(wcte1.ww1, -1) desc,coalesce(wcte1.ww2, -1) desc,mcte3.col_recid_
for each ttwi where (if1 contains "x* | wrdc & word*") or (ifext[2] contains "wordc* | word*"):
¶
with wcte1 as ( select sum(cast(contains_1(t.if1, UPPER('x*')) as int)) as ww1, sum(cast(contains_1(t.if1, UPPER('wrdc')) as int)) as ww2, sum(cast(contains_1(t.if1, UPPER('word*')) as int)) as ww3, recid from tt1 t where contains_1(t.if1, UPPER('x* | wrdc & word*'))group by t.recid) , mcte3 as ( select ttwi_1_1__0_.recid as id0_, ttwi_1_1__0_._multiplex as column1_0_, ttwi_1_1__0_._errorFlag as column2_0_, ttwi_1_1__0_._originRowid as column3_0_, ttwi_1_1__0_._errorString as column4_0_, ttwi_1_1__0_._peerRowid as column5_0_, ttwi_1_1__0_._rowState as column6_0_, ttwi_1_1__0_.if1 as if7_0_, ttwi_1_1__0_.recid as col_recid_ from tt1 ttwi_1_1__0_ where ttwi_1_1__0_._multiplex = ? and ((ttwi_1_1__0_.recid in (select recid from wcte1)) or (recid in (select distinct parent__id from tt1__5 where contains_1(upper(ifext), upper('wordc* | word*'))))) ) select mcte3.id0_, mcte3.column1_0_, mcte3.column2_0_, mcte3.column3_0_, mcte3.column4_0_, mcte3.column5_0_, mcte3.column6_0_, mcte3.if7_0_ from mcte3 left join wcte1 on (mcte3.col_recid_ = wcte1.recid) order by coalesce(wcte1.ww1, -1) desc,coalesce(wcte1.ww2, -1) desc,coalesce(wcte1.ww3, -1) desc,mcte3.col_recid_