Project

General

Profile

Word Index and CONTAINS Operator

In 4GL, the CONTAINS operator is implemented using word indices. There is no direct counterpart for such indices in the relational databases, so in FWD, the operator is implemented using word tables. In this section, the details about this are provided.

Word tables and related database objects

For the table {T} and field {f} with word index on it the word table is defined as (hereafter we provide SQL statements for the PostreSQL dialect):

create table {T}__{f} (
   parent__id int8 not null,
   word text not null
);

for non-extent {f}.
For extent {f}:
create table {T}__{f} (
   parent__id int8 not null,
   list__index int4 not null,
   word text not null
);

The updates if the word tables on inserts/updates of {T} is maintained by triggers that use the following auxiliary functions:

create or replace function words (
      recid int8, txt text, toUpperCase boolean
   )
   returns table ( parent int8, word text ) language 'plpgsql' as
   $$
   declare arr text[];
   declare w text;
   begin
       arr = words(txt, false, toUpperCase);
       foreach w in array arr loop
           parent = recid;
           word = case when touppercase then UPPER(w) else w end;
           return next;
       end loop;
   end
   $$;

create or replace function words (
      recid int8, listidx int4, txt text, toUpperCase boolean
   )
   returns table ( parent int8, idx int4, word text ) language 'plpgsql' as
   $$
   declare arr text[];
   declare w text;
   begin
       arr = words(txt, false, toUpperCase);
       foreach w in array arr loop
           parent = recid;
           idx = listidx;
           word = case when touppercase then UPPER(w) else w end;
           return next;
       end loop;
   end
   $$;


Here words(String data, boolean toUpperCase, boolean forCaseInsensitive) is a PL/Java UDF for splitting text into words.

The trigger function for a table {T} and field {f} is:

create or replace function {T}__{f}__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.recid;
    insert into {T}__{f} select * from words(new.recid, new.{f}, true);
    return new;
end;
$$;

This function is used in the following triggers:

create trigger {T}__{f}__upd after
update of {f}
    on
    {T} for each row execute procedure {T}__{f}__trg();

create trigger {T}__{f}__ins after
insert
    on
    {T} for each row execute procedure {T}__{f}__trg();

If {f} is a normalized extent field, the trigger function is:

create or replace function ttwi__f_ext__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.parent__id and {T}__{f}.list__index = new.list__index;
    insert into {T}__{f} select * from words(new.parent__id, new.list__index, new.{f}, true);
    return new;
end;
$$;

The triggers are defined for the corresponding extent table {TE} = {T}__<extent size>:

create trigger {T}__{f}__upd after
update of {f}
    on
    {TE} for each row execute procedure {T}__{f}__trg();

create trigger {T}__{f}__ins after
insert
    on
    {TE} for each row execute procedure {T}__{f}__trg();

For the denormalized extent field {f}, the situation is a little more complicated.

Let {i} be a position of the extent component and {fi} is the corresponding field name:
The trigger function for this field is:

create or replace function {T}__{fi}__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.recid and {T}__{f}.list__index = {i|;
    insert into {T}__{f} select * from words(new.recid, {i}, new.{T}__{fi}, true);
    return new;
end;
$$;

The UPDATE trigger is defined for every {fi} field:

create trigger {T}__{fi}__upd after
update of {fi}
    on
    {T} for each row execute procedure {T}__{fi}__trg();

The INSERT trigger is one:

create trigger {T}__{f}__ins after
insert
    on
    {T} for each row execute procedure {T}__{f}__trg();

It uses trigger function:

create or replace function {T}__{f}__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.recid;
    insert into {T}__{f} select * from words(new.recid, 1, new.{f1}, true);
    insert into {T}__{f} select * from words(new.recid, 2, new.{f2}, true);
    ....
    insert into {T}__{f} select * from words(new.recid, {n}, new.{fn}, true);
    return new;
end;
$$;

where n is the size of the extent.

We define the following indices/constrains for the word table _{f}.
For non-extent {f}:

alter table {T}_{f}
   add constraint pk__{T}_{f}
   primary key (parent__id, word);

create index fkidx__{T}_{f} on {T}_{f} (parent__id);

create index idx__{T}_{f} on {T}_{f} (word);

alter table {T}_{f}
   add constraint fk__{T}_{f}
   foreign key (parent__id)
   references {T}
   on delete cascade;

For extent {f} (both normalized and denormalized):

alter table {T}_{f}
   add constraint pk__{T}_{f}
   primary key (parent__id, list__index, word);

create index fkidx__{T}_{f} on {T}_{f} (parent__id);

create index idx__{T}_{f} on {T}_{f} (word);

alter table {T}_{f}
   add constraint fk__{T}__{f}
   foreign key (parent__id)
   references {T}
   on delete cascade;

The definition of triggers, indices, and constraints for all word tables in the database are located in a separate SQL script schema_word_tables_{db}_{dialect}.sql.
On the database import, this script is applied after the population of tables. However, the script is idempotent and can be applied multiple times.

SQL re-writing.

FWD converts the CONTAINS operator to the invocation of the CONTAINS UDF. This solution has two drawbacks:
  • In many situations, it results in a full table scan.
  • It is difficult (if possible at all) to achieve the 4GL-compatible implicit sorting, which depends on terms satisfying the CONTAINS condition. The more close the term to the beginning of the condition expression, the more 'important' it is for sorting.

SQL-rewriting of the CONTAINS UDF invocation using word tables addresses both these problems. Below we describe the details of the algorithm.

Consider the following SELECT query:

SELECT <flist>
FROM <from>
WHERE <where>
ORDER BY <order>

Where <flist> is the fields' list, <from> is the the FROM expression, <where> is the WHERE predicate, and <order> is a list of fields used for ordering.
Hereafter expressions in curly brackets are tables/fields names/aliases.
Let's assume that <where> contains a predicate
C = CONTAINS({T}.{f}, <expr>)

And the corresponding word index was selected for implicut ordering (at this moment it means the in the orofinal statement <order> is {T}.recid and C is the first CONTAINS in <where> for {T}.{f}).

Using DNF, we can re-write <where> as C[AND W1] [OR (NOT C)[AND W2]] [OR W3], where predicates W1, W2 W3 do not contain C.
The ordering described in #1587-407 is correct if W3 is absent or depends only on the other CONTAINS. Otherwise, it can result in duplicated records in the result set since we add {cte} to <from>, which means CROSS JOIN.

It is possible to add the analysis and re-writing of the <where> predicate, but it can take time. So I suggest the following algorithm.

Let <expr> = >P(t1, t2, ..., tn)> where terms ti are ordered according to the first position where that found in <expr>.

If the field {f} is not an extent one the re-written query is

WITH 
{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    JOIN {wtable} AS {walias1} ON ({walias1}.parent__id = t.recid AND (<or group1>)) 
    JOIN {wtable} AS {walias2} ON ({walias2}.parent__id = t.recid AND (<or group2>)) 
    ....
    JOIN {wtable} AS {waliask} ON (<walias2>.parent__id = t.recid AND (<or groupk>)) 
GROUP BY recid
),
{mcte} AS (
SELECT <flist>, recid
FROM <from>
WHERE {where'}>
)
SELECT <flist'> FROM {mcte}
               LEFT JOIN <cte> ON ({mcte}.recid = <cte>.recid)
ORDER BY coalesce({cte}.{w1}, -1) desc, coalesce({cte}.{w2}, -1) desc, ..., coalesce({cte}.{wn}, -1) desc,
         {mcte}.recid

Here <where'> is <where> predicate with C replaced by C' = {mcte}.recid IN (SELECT recid FROM {cte}) and <flist'> is <flist> with table aliases replaced with {mcte}.

<or groupi> and <wej> are defined as following:
Let P(t1, t2,...) = (t11 | t12 | ...) & (t21 | t22 | ...) ... & (tk1 | tk2 | ...) is CNF Of P. Terms tij an are one of ti. Sub-terms tij in every AND term are ordered according the ordering of ti and AND terms are ordered lexicographically.
<or groupi> is ti1' OR ti2' OR ... where tij' is either waliasi}.word = tij or waliasi}.word LIKE tij.
The weight expression <wei> = SUM(CAST tpq' as int)) where tpq' is the first term in CNF which corresponds to ti.

If the field {f} is an extent one the re-written query is:

WITH 
{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    JOIN {ET} e ON (e.parent__id = .recid)
    JOIN {wtable} AS {walias1} ON ({walias1}.parent__id = t.recid AND {walias1}.list__index  = e.list__index AND (<or group1>)) 
    JOIN {wtable} AS {walias2} ON ({walias2}.parent__id = t.recid AND {walias2}.list__index  = e.list__index AND (<or group2>)) 
    ....
    JOIN {wtable} AS {waliask} ON (<walias2>.parent__id = t.recid AND {waliask}.list__index  = e.list__index  AND (<or groupk>)) 
GROUP BY recid
),
{mcte} AS (
SELECT <flist>, recid
FROM <from>
WHERE {where'}>
)
SELECT <flist'> FROM {mcte}
               LEFT JOIN <cte> ON ({mcte}.recid = <cte>.recid)
ORDER BY coalesce({cte}.{w1}, -1) desc, coalesce({cte}.{w2}, -1) desc, ..., coalesce({cte}.{wn}, -1) desc,
         {mcte}.recid

Here {ET} is the extent table.

For _temp and 'dirty' databases where CONTAINS UDF is used the re-written query looks the same, but with different {cte}. For non-extent field we use:

{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    WHERE CONTAINS(t.{f}, expr) 
)

Here <wei> = CONTAINS(t.{f}, ti).

For extent field:

{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    JOIN {ET} e ON (e.parent__id = t.recid AND CONTAINS(e.{f}, expr))
GROUP BY recid
)

Here <wei> = SUM(CAST(CONTAINS(e.{f}, ti)) as int)).

Sample re-written queries.

Consider the following 4GL table:
ADD TABLE "ttwi"
AREA "Schema Area"
DUMP-NAME "ttwi"

ADD FIELD "if1" OF "ttwi" AS character
FORMAT "x(128)"
INITIAL ""
POSITION 2
MAX-WIDTH 256
ORDER 10

ADD FIELD "if2" OF "ttwi" AS character
FORMAT "x(8)"
INITIAL ""
POSITION 3
MAX-WIDTH 16
ORDER 20
CASE-SENSITIVE

ADD FIELD "f-ext" OF "ttwi" AS character
FORMAT "x(50)"
INITIAL ""
POSITION 4
MAX-WIDTH 510
EXTENT 5
ORDER 30

ADD INDEX "wi1" ON "ttwi"
AREA "Schema Area"
WORD
INDEX-FIELD "if1" ASCENDING

ADD INDEX "wi2" ON "ttwi"
AREA "Schema Area"
INACTIVE
WORD
INDEX-FIELD "if2" ASCENDING

ADD INDEX "wiext" ON "ttwi"
AREA "Schema Area"
INACTIVE
WORD
INDEX-FIELD "f-ext" ASCENDING

Consider also the table ttwid with the same structure but that is converted using denormalized extent field f-ext.
Below we provide sample 4GL queries and corresponding re-written SQL statements:

for each ttwi where (if2 contains 'big & elephant | small & rabbit') or (if1 '(cat* | ant*)'):

with
wcte1 as (
     select sum(cast(w1.word = 'big' as int)) as w11, sum(cast(w3.word = 'elephant' as int)) as w32, sum(cast(w1.word = 'small' as int)) as w13, sum(cast(w2.word = 'rabbit' as int)) as w24, recid
    from ttwi t
        join ttwi__if2 as w1 on (w1.parent__id = t.recid and (w1.word = 'big' or w1.word = 'small'))
        join ttwi__if2 as w2 on (w2.parent__id = t.recid and (w2.word = 'big' or w2.word = 'rabbit'))
        join ttwi__if2 as w3 on (w3.parent__id = t.recid and (w3.word = 'elephant' or w3.word = 'small'))
        join ttwi__if2 as w4 on (w4.parent__id = t.recid and (w4.word = 'elephant' or w4.word = 'rabbit'))
    group by recid
)
,
wcte2 as (
     select distinct recid
    from ttwi t
        join ttwi__if1 as w1 on (w1.parent__id = t.recid and (w1.word like UPPER('cat%') or w1.word like UPPER('ant%')))
)
,
mcte3 as (
select 
    ttwi__impl0_.recid as col_0_0_, ttwi__impl0_.recid as col_recid_ 
from
    ttwi ttwi__impl0_ 
where
    (ttwi__impl0_.recid in (select recid from wcte1)) or (ttwi__impl0_.recid in (select recid from wcte2))
)
select mcte3.col_0_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)

order by coalesce(wcte1.w11, -1)  desc,coalesce(wcte1.w32, -1)  desc,coalesce(wcte1.w13, -1)  desc,coalesce(wcte1.w24, -1)  desc,mcte3.col_recid_

for each ttwi where (f-ext contains ''big & elephant | small & rabbit') or (f-ext contains 'word22a & word22b | word23a & word23b'):

with
wcte1 as (
     select sum(cast(w1.word like UPPER('word1%') as int)) as w11, sum(cast(w1.word like UPPER('word3%') as int)) as w12, sum(cast(w2.word = UPPER('word12c') as int)) as w23, sum(cast(w2.word = UPPER('word33c') as int)) as w24, recid
    from ttwi t
    join ttwi__5 e on (e.parent__id = t.recid)
        join ttwi__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word like UPPER('word1%') or w1.word like UPPER('word3%')))
        join ttwi__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = UPPER('word12c') or w2.word = UPPER('word33c')))
    group by recid
)
,
mcte3 as (
select 
    ttwi__impl0_.recid as col_0_0_, ttwi__impl0_.recid as col_recid_ 
from
    ttwi ttwi__impl0_ 
where
    (ttwi__impl0_.recid in (select recid from wcte1)) or (ttwi__impl0_.recid in (
     select distinct recid
    from ttwi t
    join ttwi__5 e on (e.parent__id = t.recid)
        join ttwi__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word = UPPER('word22a') or w1.word = UPPER('word23a')))
        join ttwi__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = UPPER('word22a') or w2.word = UPPER('word23b')))
        join ttwi__f_ext as w3 on (w3.parent__id = e.parent__id and w3.list__index  = e.list__index and (w3.word = UPPER('word22b') or w3.word = UPPER('word23a')))
        join ttwi__f_ext as w4 on (w4.parent__id = e.parent__id and w4.list__index  = e.list__index and (w4.word = UPPER('word22b') or w4.word = UPPER('word23b')))
))
)
select mcte3.col_0_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)
order by coalesce(wcte1.w11, -1)  desc,coalesce(wcte1.w12, -1)  desc,coalesce(wcte1.w23, -1)  desc,coalesce(wcte1.w24, -1)  desc,mcte3.col_recid_

for each ttwid where (f-ext[1] contains '(word1* | word3*) & (word12c | word33c)') or (f-ext contains 'word22a & word22b | word23a & word23b'):

with
wcte1 as (
     select sum(cast(w1.word like UPPER('word1%') as int)) as w11, sum(cast(w1.word like UPPER('word3%') as int)) as w12, sum(cast(w2.word = UPPER('word12c') as int)) as w23, sum(cast(w2.word = UPPER('word33c') as int)) as w24, recid
    from ttwid t
    join (select distinct parent__id, list__index from ttwid__f_ext) e on (e.parent__id = t.recid)
        join ttwid__f_ext as w1 on (w1.parent__id = t.recid and (w1.word like UPPER('word1%') or w1.word like UPPER('word3%')))
        join ttwid__f_ext as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('word12c') or w2.word = UPPER('word33c')))
    group by recid
)
,
mcte3 as (
select 
    ttwid__imp0_.recid as col_0_0_, ttwid__imp0_.recid as col_recid_ 
from
    ttwid ttwid__imp0_ 
where
    (ttwid__imp0_.recid in (select recid from wcte1)) or (ttwid__imp0_.recid in (
     select distinct recid
    from ttwid t
    join (select distinct parent__id, list__index from ttwid__f_ext) e on (e.parent__id = t.recid)
        join ttwid__f_ext as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('word22a') or w1.word = UPPER('word23a')))
        join ttwid__f_ext as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('word22a') or w2.word = UPPER('word23b')))
        join ttwid__f_ext as w3 on (w3.parent__id = t.recid and (w3.word = UPPER('word22b') or w3.word = UPPER('word23a')))
        join ttwid__f_ext as w4 on (w4.parent__id = t.recid and (w4.word = UPPER('word22b') or w4.word = UPPER('word23b')))
))
)
select mcte3.col_0_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)
order by coalesce(wcte1.w11, -1)  desc,coalesce(wcte1.w12, -1)  desc,coalesce(wcte1.w23, -1)  desc,coalesce(wcte1.w24, -1)  desc,mcte3.col_recid_

Sample re-written queries with UDF.

Consider a temporary table ttwi with the same structure as before. We cannot use word tables since the H2 support for temporary tables' triggers (which FWD uses for the _temp database) is broken. However, the re-writing is still needed to support extent fields and implicit ordering. Below we provide a sample re-written SQL statement with UDF.

for each ttwi where (ifext[2] contains "wordc* | word*") or (if1 contains "x* | wrdc & word*"):

with
wcte1 as (
    select sum(cast(contains_1(e.ifext,  UPPER('wordc*')) as int)) as ww1, sum(cast(contains_1(e.ifext,  UPPER('word*')) as int)) as ww2, recid
    from tt1 t
    join tt1__5 e on (e.parent__id = t.recid)
    where contains_1(e.ifext,  UPPER('wordc* | word*'))group by t.recid)
,
mcte3 as (
select 
    ttwi_1_1__0_.recid as id0_, ttwi_1_1__0_._multiplex as column1_0_, ttwi_1_1__0_._errorFlag as column2_0_, ttwi_1_1__0_._originRowid as column3_0_, ttwi_1_1__0_._errorString as column4_0_, ttwi_1_1__0_._peerRowid as column5_0_, ttwi_1_1__0_._rowState as column6_0_, ttwi_1_1__0_.if1 as if7_0_, ttwi_1_1__0_.recid as col_recid_ 
from
    tt1 ttwi_1_1__0_ 
where
    ttwi_1_1__0_._multiplex = ? and ((ttwi_1_1__0_.recid in (select recid from wcte1)) or contains_1(upper(if1), upper('x* | wrdc & word*')))
)
select mcte3.id0_, mcte3.column1_0_, mcte3.column2_0_, mcte3.column3_0_, mcte3.column4_0_, mcte3.column5_0_, mcte3.column6_0_, mcte3.if7_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)
order by coalesce(wcte1.ww1, -1)  desc,coalesce(wcte1.ww2, -1)  desc,mcte3.col_recid_

for each ttwi where (if1 contains "x* | wrdc & word*") or (ifext[2] contains "wordc* | word*"):

with
wcte1 as (
    select sum(cast(contains_1(t.if1,  UPPER('x*')) as int)) as ww1, sum(cast(contains_1(t.if1,  UPPER('wrdc')) as int)) as ww2, sum(cast(contains_1(t.if1,  UPPER('word*')) as int)) as ww3, recid
    from tt1 t
    where contains_1(t.if1,  UPPER('x* | wrdc & word*'))group by t.recid)
,
mcte3 as (
select 
    ttwi_1_1__0_.recid as id0_, ttwi_1_1__0_._multiplex as column1_0_, ttwi_1_1__0_._errorFlag as column2_0_, ttwi_1_1__0_._originRowid as column3_0_, ttwi_1_1__0_._errorString as column4_0_, ttwi_1_1__0_._peerRowid as column5_0_, ttwi_1_1__0_._rowState as column6_0_, ttwi_1_1__0_.if1 as if7_0_, ttwi_1_1__0_.recid as col_recid_ 
from
    tt1 ttwi_1_1__0_ 
where
    ttwi_1_1__0_._multiplex = ? and ((ttwi_1_1__0_.recid in (select recid from wcte1)) or (recid in (select distinct parent__id  from tt1__5 where contains_1(upper(ifext), upper('wordc* | word*')))))
)
select mcte3.id0_, mcte3.column1_0_, mcte3.column2_0_, mcte3.column3_0_, mcte3.column4_0_, mcte3.column5_0_, mcte3.column6_0_, mcte3.if7_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)
order by coalesce(wcte1.ww1, -1)  desc,coalesce(wcte1.ww2, -1)  desc,coalesce(wcte1.ww3, -1)  desc,mcte3.col_recid_