Project

General

Profile

Feature #1587

Feature #1585: add conversion and runtime support for word indexes

implement full support for word indexes

Added by Eric Faulhaber over 11 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Igor Skornyakov
Start date:
Due date:
% Done:

100%

billable:
No
vendor_id:
GCD

fwd1.df (7.21 KB) Igor Skornyakov, 12/03/2020 07:30 AM

schema_table_fwd1_postgresql.sql (3.45 KB) Igor Skornyakov, 12/03/2020 07:30 AM

schema_trigger_fwd1_postgresql.sql (1.29 KB) Igor Skornyakov, 12/03/2020 07:30 AM

schema_index_fwd1_postgresql.sql (1.05 KB) Igor Skornyakov, 12/03/2020 07:30 AM

build_db.xml Magnifier (9.43 KB) Igor Skornyakov, 12/24/2020 04:32 PM

build_db.xml Magnifier (9.74 KB) Igor Skornyakov, 01/05/2021 08:30 AM


Related issues

Related to Database - Feature #4397: add database attrs, methods and options Closed
Related to Database - Bug #5085: uppercasing and string comparisons are incorrect for some character sets New
Follows Database - Feature #1586: add quick and dirty support for word indexes using LIKE operator Closed

History

#1 Updated by Eric Faulhaber over 11 years ago

  • Estimated time set to 104.00

A "true" implementation of word indexes will be necessary if (when) the quick and dirty implementation using LIKE is found to not perform well enough. The likely candidate for the base technology behind this feature is Hibernate Search.

We anticipate that the LIKE implementation will suffice for the dirty database. Not sure about temp tables in general, though.

#2 Updated by Eric Faulhaber over 11 years ago

  • Start date deleted (10/16/2012)
  • Due date deleted (10/16/2012)

#3 Updated by Greg Shah over 11 years ago

  • Target version set to Milestone 7

#4 Updated by Greg Shah over 11 years ago

  • Target version changed from Milestone 7 to Milestone 17

#5 Updated by Igor Skornyakov over 7 years ago

Here is the (expanded) summary of the recent e-mail discussion between Greg, Eric, Ovidiu and myself regrading the support of the CONTAINS operator in P2J.

  • The 4GL sematics of the CONTAINS operator is described in OpenEdge Development:Programming Interfaces. Section 1-41 : Database Access : The CONTAINS operator.
  • The CONTAINS operator sematics is based on the notion of a word index with uses the word-break tables describing how a sequence of characters is split into an (unordered) set of words. The word-break tables assigns attributes to characters of a particular code page. These attributes can be context-sensitive such as BEFORE_LET_DIG with means that a characted is treated as part of a word only if followed by a character with the LETTER or DIGIT attribute. This makes using regular expressions even if thay are supported in some form by the underlying RDBMS problematic if possible at all regardless of the performance considerations.
  • At this moment we consider the following approaches:
    - (Java) UDF - based
    - Using text search engine such as Apache Lucene (directly or via Hibernate Search)
    - Using custom work-break tables
  • The UDF - based approach requires minimal development efforts but has serious problems with performance. The UDF are opaque to the query analyser and require a full table scan. According to Eric

There are 10 word indexes on 5 tables in <db_name>. The largest of these contains 13,220,238 rows. The indexed field is char_data, which is case-insensitive. The longest value in this column in the table is 118 characters, at least in this data set (there are many others for different end customers of our customer).

The next biggest table containing a word index has a relatively small 17,883 rows in this data set.

Whether we use LIKE with leading and trailing wildcards or a UDF, we are doing a sequential scan, so a UDF approach will only be slower by the actual speed of the function execution (plus any PL/Java overhead), multiplied by the number of rows in the table. This additional, per row overhead apparently is very significant.

So, my curiosity got the better of me, and I just did a quick test:

<db_name>_12_00=# explain analyze select char_data from my_field where upper(char_data) like '%COMPLETED%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on my_field (cost=0.00..364743.85 rows=1322 width=1) (actual time=60.583..2442.854 rows=157 loops=1)
Filter: (upper(char_data) ~~ '%COMPLETED%'::text)
Total runtime: 2442.916 ms
(3 rows)

<db_name>_12_00=# explain analyze select char_data from my_field where matches(upper(char_data), '*COMPLETED*', false);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Seq Scan on my_field (cost=0.00..3636691.38 rows=4406663 width=1) (actual time=2102.047..100264.943 rows=157 loops=1)
Filter: matches(upper(char_data), '*COMPLETED*'::text, false)
Total runtime: 100265.069 ms
(3 rows)

I used the existing "matches" UDF for this test, as one that might be close in performance (actually, probably faster than a real regex UDF would be). In fact, it is functionally equivalent for this particular case (both find the same 157 rows). So, the LIKE expression is ~41x faster in this case, even though no index is used.

These results were after running each search several times, so this performance discrepancy was not due to a cold database. Based on this data point at least, a UDF approach is not looking very good at all. This is generally supported by numerous of our experiences over time that it is always faster to use native database features in preference to PL/Java UDFs, where we can make the result functionally equivalent. In the past, I always thought it was because the query plan changed, but in this case, it seems to be due purely to the overhead of PL/Java and our UDF implementations.

Just as another data point, I tried invoking the trimws UDF in a table scan, as a potentially lighter-weight UDF implementation:

<db_name>_12_00=# explain analyze select char_data from my_field where char_data is not null and trimws(upper(char_data)) is not null;
> QUERY PLAN                                                        
> ---------------------------------------------------------------------------------------------------------------------------
> Seq Scan on my_field  (cost=0.00..3636691.38 rows=13153890 width=1) (actual time=0.043..36450.914 rows=13220233 loops=1)
> Filter: ((char_data IS NOT NULL) AND (trimws(upper(char_data)) IS NOT NULL))
> Total runtime: 37460.664 ms
> (3 rows)
> 

It is significantly faster (note that the "char_data is not null" check doesn't cut out many rows, but was necessary to prevent an NPE; 13,220,233 still pass this filter), but still nowhere near the LIKE plan. This suggests that a large portion of the overhead is in our UDF implementation. We would need a do-nothing UDF which returns immediately to measure the PL/Java-specific overhead, but I anticipate it is still considerable. * The Lucene use requires additional investigation. At least it requires upgrade to Hibernate 5.x (if Hibernate Search will be used). It is also unclear for me if the 4GL word-break table semantics can be supported and (more important) if the CONTAINS operator supported in this way can be mixed with other operators in a single SQL query.
  • The custom word-break table can be implemented pretty straightforward. It will contain primary key columns of the original table and a column containing one of the words. The CONTAINS will be mapped to logical expression containing EQ and LIKE operators. The problem is with updating this table when the source table is updated. How will it will perform is not clear.

#6 Updated by Igor Skornyakov over 7 years ago

Igor Skornyakov wrote:

Here is the (expanded) summary of the recent e-mail discussion between Greg, Eric, Ovidiu and myself regrading the support of the CONTAINS operator in P2J.

  • The 4GL sematics of the CONTAINS operator is described in OpenEdge Development:Programming Interfaces. Section 1-41 : Database Access : The CONTAINS operator.
  • The CONTAINS operator semantics is based on the notion of a word index with uses the word-break tables describing how a sequence of characters is split into an (unordered) set of words. The word-break tables assigns attributes to characters of a particular code page. These attributes can be context-sensitive such as BEFORE_LET_DIG with means that a characted is treated as part of a word only if followed by a character with the LETTER or DIGIT attribute. This makes using regular expressions even if thay are supported in some form by the underlying RDBMS problematic if possible at all regardless of the performance considerations.
  • At this moment we consider the following approaches:
    - (Java) UDF - based
    - Using text search engine such as Apache Lucene (directly or via Hibernate Search)
    - Using custom work-break tables
  • The UDF - based approach requires minimal development efforts but has serious problems with performance. The UDF are opaque to the query analyser and require a full table scan. According to Eric

There are 10 word indexes on 5 tables in <db_name>. The largest of these contains 13,220,238 rows. The indexed field is char_data, which is case-insensitive. The longest value in this column in the table is 118 characters, at least in this data set (there are many others for different end customers of our customer).

The next biggest table containing a word index has a relatively small 17,883 rows in this data set.

Whether we use LIKE with leading and trailing wildcards or a UDF, we are doing a sequential scan, so a UDF approach will only be slower by the actual speed of the function execution (plus any PL/Java overhead), multiplied by the number of rows in the table. This additional, per row overhead apparently is very significant.

So, my curiosity got the better of me, and I just did a quick test:

<db_name>_12_00=# explain analyze select char_data from my_field where upper(char_data) like '%COMPLETED%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on my_field (cost=0.00..364743.85 rows=1322 width=1) (actual time=60.583..2442.854 rows=157 loops=1)
Filter: (upper(char_data) ~~ '%COMPLETED%'::text)
Total runtime: 2442.916 ms
(3 rows)

<db_name>_12_00=# explain analyze select char_data from my_field where matches(upper(char_data), '*COMPLETED*', false);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Seq Scan on my_field (cost=0.00..3636691.38 rows=4406663 width=1) (actual time=2102.047..100264.943 rows=157 loops=1)
Filter: matches(upper(char_data), '*COMPLETED*'::text, false)
Total runtime: 100265.069 ms
(3 rows)

I used the existing "matches" UDF for this test, as one that might be close in performance (actually, probably faster than a real regex UDF would be). In fact, it is functionally equivalent for this particular case (both find the same 157 rows). So, the LIKE expression is ~41x faster in this case, even though no index is used.

These results were after running each search several times, so this performance discrepancy was not due to a cold database. Based on this data point at least, a UDF approach is not looking very good at all. This is generally supported by numerous of our experiences over time that it is always faster to use native database features in preference to PL/Java UDFs, where we can make the result functionally equivalent. In the past, I always thought it was because the query plan changed, but in this case, it seems to be due purely to the overhead of PL/Java and our UDF implementations.

Just as another data point, I tried invoking the trimws UDF in a table scan, as a potentially lighter-weight UDF implementation:

[...]
It is significantly faster (note that the "char_data is not null" check doesn't cut out many rows, but was necessary to prevent an NPE; 13,220,233 still pass this filter), but still nowhere near the LIKE plan. This suggests that a large portion of the overhead is in our UDF implementation. We would need a do-nothing UDF which returns immediately to measure the PL/Java-specific overhead, but I anticipate it is still considerable.

  • The Lucene use requires additional investigation. At least it requires upgrade to Hibernate 5.x (if Hibernate Search will be used). It is also unclear for me if the 4GL word-break table semantics can be supported and (more important) if the CONTAINS operator supported in this way can be mixed with other oerators in a single SQL query.
  • The custom word-break table can be implemented pretty straighforward. It will contain primary key columns of the original table and a column containing one of the words. The CONTAINS will be mapped to logical expression containing EQ and LIKE operators. The problem is with updating this table when the source table is updated. How will it will perform is not clear.

LE: GES removed the database name in this entry.

#7 Updated by Igor Skornyakov over 7 years ago

Regarding the queries with use of a word-break table. In a case of the complex expression in the CONTAINS it can be converted to a normal disjunctive form (DNF) and several queries (one for each disjunctive term) can be submitted in parallel. Individual queries match perfectly to inner JOIN syntax. Of course such a transformation can be performed by a DB query analyser but not necessary.

#8 Updated by Eric Faulhaber over 7 years ago

Of course, we want as few round trips to the database as possible, ideally only one. Please provide an example of what queries you would intend to execute for a sample, complex expression.

#9 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

Of course, we want as few round trips to the database as possible, ideally only one. Please provide an example of what queries you would intend to execute for a sample, complex expression.

For a single disjunctive term:
Let t be a original table with primary key pk, f - a multi-word field, wb - a word-break table, w - its word field.
Then the query SELECT <fields> FROM t WHERE P AND f CONTAINS '(foo & bar*)' (P is a predicate w/o CONTAINS) can be converted to

SELECT <fields> FROM t 
   JOIN wb AS wb1 ON (wb1.pk = t.pk AND wb1.w = 'foo')
   JOIN wb AS wb2 ON (wb2.pk = t.pk AND wb2.w LIKE 'bar%')
WHERE P

For multiple disjunctive terms a UNION ALL can be used or several queries can be submitted in parallel.

#10 Updated by Eric Faulhaber over 7 years ago

I like the directness of this approach and in fact I have been trying to move as much HQL processing to SQL for performance reasons. However, this may be somewhat problematic, as all legacy (i.e., 4GL) queries are converted to HQL, not SQL. This type of join is not supported by HQL directly, unless we represent the word-break table(s) as first class data model objects and add ORM configuration for them, expressing that they are associated with the primary table containing the word-indexed field (t.w in your example). Furthermore, UNION ALL is not a supported construct in HQL.

While we have APIs which execute SQL directly, these are not used for application-level queries currently. We would need some special facility to do this, which does not exist today. This would have to be integrated to work with the various query types, in order to maintain the Progress compatible behavior for these legacy queries.

#11 Updated by Igor Skornyakov over 7 years ago

The use of UNION ALL can be avoided with use of CNF (Conjunctive Normal Form). For example
SELECT <fields> FROM t WHERE P AND f CONTAINS '(one | two) & (three | four*))' (see note 9) can be converted to

SELECT <fields> FROM t 
   JOIN wb AS wb1 ON (wb1.pk = t.pk AND (wb1.w = 'one' OR wb1.w = 'two'))
   JOIN wb AS wb2 ON (wb2.pk = t.pk AND (wb2.w = 'three' OR wb2.w LIKE 'four%'))
WHERE P

I would prefer to use the approach based on the non-recursive WITH clause, but I'm not sure that it is supported by Hibernate:

WITH keys AS (
   SELECT pk FROM t 
      JOIN wb AS wb1 ON (wb1.pk = t.pk AND (wb1.w = 'one' OR wb1.w = 'two'))
      JOIN wb AS wb2 ON (wb2.pk = t.pk AND (wb2.w = 'three' OR wb2.w LIKE 'four%'))
)
SELECT <fields> FROM t JOIN keys ON (t.pk = keys.pk)
WHERE P

or if pk is a single field
WITH keys AS (
   SELECT pk FROM t 
      JOIN wb AS wb1 ON (wb1.pk = t.pk AND (wb1.w = 'one' OR wb1.w = 'two'))
      JOIN wb AS wb2 ON (wb2.pk = t.pk AND (wb2.w = 'three' OR wb2.w LIKE 'four%'))
)
SELECT <fields> FROM t WHERE P and t.pk IN (SELECT pk FROM keys)

The last form requires minimal modifications of the original query. We just add a WITH clause and replace t.f CONTAINS p with t.pk IN (SELECT pk FROM keys)

#12 Updated by Eric Faulhaber over 7 years ago

You have come up with some good potential solutions to deal with the CONTAINS syntax as SQL, but the impedance mismatch with HQL remains. I don't know of any support for the WITH keyword, as HQL takes a very object-centric approach. Please see the documentation on supported HQL syntax: http://docs.jboss.org/hibernate/orm/4.1/manual/en-US/html/ch16.html.

As I see it, the primary problem remains integrating with the non-CONTAINS portions of the query and the overall integration of your SQL-based solutions into the existing legacy query infrastructure, which is completely HQL-based. We've dealt with converting arbitrarily complex 4GL where clause expressions into HQL, and integrating these converted where clauses into full HQL query statements. Hibernate then converts these HQL statements into the appropriate, dialect-specific SQL.

Perhaps it makes sense to take a quick look at Hibernate Search to see how/if it deals with these issues, whether it solves any problems for us or creates any new ones. This will help inform how aggressively we try to address these mismatches.

#13 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

You have come up with some good potential solutions to deal with the CONTAINS syntax as SQL, but the impedance mismatch with HQL remains. I don't know of any support for the WITH keyword, as HQL takes a very object-centric approach. Please see the documentation on supported HQL syntax: http://docs.jboss.org/hibernate/orm/4.1/manual/en-US/html/ch16.html.

As I see it, the primary problem remains integrating with the non-CONTAINS portions of the query and the overall integration of your SQL-based solutions into the existing legacy query infrastructure, which is completely HQL-based. We've dealt with converting arbitrarily complex 4GL where clause expressions into HQL, and integrating these converted where clauses into full HQL query statements. Hibernate then converts these HQL statements into the appropriate, dialect-specific SQL.

Perhaps it makes sense to take a quick look at Hibernate Search to see how/if it deals with these issues, whether it solves any problems for us or creates any new ones. This will help inform how aggressively we try to address these mismatches.

OK. I will take a closer look at HQL and Hibernate search at the weekend.

#14 Updated by Eric Faulhaber over 7 years ago

We've already made several changes to the HQL->SQL generator in Hibernate. I could envision placing a marker keyword or function of some kind in the HQL containing the minimal necessary information for the CONTAINS-specific syntax, then expanding this into the word-break-specific portions of the SQL you expressed above (basically the WITH keys AS( ... ) portion), and injecting that into the SQL which normally would be generated. This would let us leverage the HQL generation logic we've already got for everything else, while taking advantage of your new ideas.

But let's still explore Hibernate Search to see what it offers, so we better understand all our options.

#15 Updated by Eric Faulhaber over 7 years ago

Some other things to consider:
  • How and when does Progress update its word break tables? We need to match this behavior, so we're not surprised by query results.
  • What are the implications for dirty database? Progress does not use an MVCC model, so when "regular" (i.e., non-word) indexes are updated in an uncommitted Progress transaction, this information leaks to other sessions, such that the order of records visited walking that index in another session can change. Is there any such behavior associated with word indexes?

I expect Hibernate Search, even though it deals with the core issue of full text search, will not match our legacy requirements in these areas.

#16 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

  • How and when does Progress update its word break tables? We need to match this behavior, so we're not surprised by query results.

According to http://knowledgebase.progress.com/articles/Article/P68611

Every time a record is added to a word-indexed field, the OpenEdge examines the contents of the field, breaks it down into individual words, and, for each individual word, creates or modifies a word index entry. To break down the contents of a field into individual words, OpenEdge must know which characters act as word delimiters. To get this information, OpenEdge consults the database’s word-break table, which lists characters and describes the word-delimiting properties of each.

I've not found so far the explicit description of what happens when the record contained a word-indexed field is deleted or updated.

#17 Updated by Igor Skornyakov over 7 years ago

Regarding the HQL compliance. As a hack it is possible to create auxiliary views for n = 1,..., N which are the "fibered Cartesian power" of wb over t:

CREATE VIEW wbn AS
SELECT t.pk, wb1.w AS w1, wb2.w AS w2, .... wbn.w AS w
FROM t
   JOIN wb AS wb1 ON (wb1.pk = t.pk)
   JOIN wb AS wb2 ON (wb2.pk = t.pk)
   .......
   JOIN wb AS wbn ON (wbn.pk = t.pk)

and use a corresponding wbk view when a CNF of the CONTAINS expression has k conjunctive terms.
For example (see note 11):
SELECT <fields> FROM t WHERE P and t.pk IN (SELECT pk FROM wb2 WHERE (wb2.w1 = 'one' OR wb2.w1 = 'two') AND (wb2.w2 = 'three' OR wb2.w2 LIKE 'four%'))

I think that a dozen of such views will be sufficient for all real life queries.

#18 Updated by Eric Faulhaber over 7 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

  • How and when does Progress update its word break tables? We need to match this behavior, so we're not surprised by query results.

According to http://knowledgebase.progress.com/articles/Article/P68611
[...]

I've not found so far the explicit description of what happens when the record contained a word-indexed field is deleted or updated.

The documentation is clearly a good place to start, but ultimately, we need to confirm any documentation with test cases. We've found on many occasions that the primary Progress documentation is lacking in detail or simply wrong. OTOH, it seems the knowledgebase articles like this tend to be much better.

#19 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

The documentation is clearly a good place to start, but ultimately, we need to confirm any documentation with test cases. We've found on many occasions that the primary Progress documentation is lacking in detail or simply wrong. OTOH, it seems the knowledgebase articles like this tend to be much better.

I understand this. The recent example is the description of MATCHES and BEGINS semantics with respect to the case-sensitivity of arguments - it is correct for BEGINS but wrong for MATCHES.

#20 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

  • What are the implications for dirty database? Progress does not use an MVCC model, so when "regular" (i.e., non-word) indexes are updated in an uncommitted Progress transaction, this information leaks to other sessions, such that the order of records visited walking that index in another session can change.

Can an application logic use such dirty reads in any reliable and consistent way? I can hardly imagine how it can be done. In other words - does Progress provide any usable guarantees regarding transactions instead of ACID? If it does not I do not understand why it is necessary to mimic such a behavior.

#21 Updated by Eric Faulhaber over 7 years ago

Igor Skornyakov wrote:

Can an application logic use such dirty reads in any reliable and consistent way? I can hardly imagine how it can be done.

Yes. We thought the very same thing when we first discovered this behavior, which we considered more a bug than a feature: surely no one would ever rely upon it for any kind of intentional application behavior.

Well, we underestimated how creative Progress developers could be in the face of a requirement which has no backing support in the 4GL environment. The reason we had to implement support for this in the first place was a feature in the first application we converted: before sequences were available, a developer found a way to use this quirk to mimic a sequence across sessions using two buffers on the same table, an index, a loop, and record locking.

In other words - does Progress provide any usable guarantees regarding transactions instead of ACID? If it does not I do not understand why it is necessary to mimic such a behavior.

I don't know of any formal guarantee regarding this behavior. We have never found it documented anywhere. However, it does seem to be consistent and we have found dependencies upon it. I recently had to fix a deadlock issue due to some record sorting that relied in part on this behavior. So, yes, unfortunately, it is necessary to mimic this behavior.

However, I don't know whether word indexes behave the same way. They seem to be a different animal, but we need to know for sure.

#22 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

Can an application logic use such dirty reads in any reliable and consistent way? I can hardly imagine how it can be done.

Yes. We thought the very same thing when we first discovered this behavior, which we considered more a bug than a feature: surely no one would ever rely upon it for any kind of intentional application behavior.

Well, we underestimated how creative Progress developers could be in the face of a requirement which has no backing support in the 4GL environment. The reason we had to implement support for this in the first place was a feature in the first application we converted: before sequences were available, a developer found a way to use this quirk to mimic a sequence across sessions using two buffers on the same table, an index, a loop, and record locking.

In other words - does Progress provide any usable guarantees regarding transactions instead of ACID? If it does not I do not understand why it is necessary to mimic such a behavior.

I don't know of any formal guarantee regarding this behavior. We have never found it documented anywhere. However, it does seem to be consistent and we have found dependencies upon it. I recently had to fix a deadlock issue due to some record sorting that relied in part on this behavior. So, yes, unfortunately, it is necessary to mimic this behavior.

However, I don't know whether word indexes behave the same way. They seem to be a different animal, but we need to know for sure.

Wow! I see. Thank you. I will check.

#23 Updated by Igor Skornyakov over 7 years ago

As far as I can see Hibernate Search is just a wrapper on the top of Apache Lucene. It keeps indexes in a separate storage so the full-text search requests cannot be mixed (at least in any efficient way) with normal SQL queries.

#24 Updated by Eric Faulhaber over 7 years ago

OK, then it looks like we will take the approach in note 14.

#25 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

OK, then it looks like we will take the approach in note 14.

Well, the only remaining question is how to update the index on the master table updates. For PostgreSQL and Oracle this can be done with Java UDF for splitting the text field into the set of words and the trigger which uses it. I was working with MS SQL Server too long ago to suggest a solution for it.

#26 Updated by Eric Faulhaber over 7 years ago

Before we decide on an implementation for the word index updates (and deletions), we need to understand the way Progress does this. I didn't see any further details on this after note 19. Our implementation cannot differ in the timing or behavior.

#27 Updated by Igor Skornyakov over 7 years ago

Eric Faulhaber wrote:

Before we decide on an implementation for the word index updates (and deletions), we need to understand the way Progress does this. I didn't see any further details on this after note 19. Our implementation cannot differ in the timing or behavior.

I understand this. I will start a detailed investigation after the currently assigned task will be finished as you wrote.

#28 Updated by Igor Skornyakov over 7 years ago

With 3197b rev. 11126 the issue with case-sensitivity of the MATCHES/BEGINS arguments' is almost solved, The problem still exists with left-side expressions. For example 'a' + cs matches '*atest' is converted to concat('A', upper(tt.cs)) like '%atest'.
There are also issues with leading/trailing whitespaces.

#29 Updated by Igor Skornyakov over 7 years ago

In fact the problem with case-sensitive expressions in the WHERE clause is more general. For example for each tt no-lock where "a" + tt.ci = exps where exps is case-sensitive variable is converted to

         forEach("loopLabel0", new Block((Init) () -> 
         {
            query0.initialize(tt, "concat('A', upper(tt.ci)) = ?", null, "tt.id asc", new Object[]
            {
               toUpperCase(exps)
            }, LockType.NONE);
         }, 

which may provide extra records.

#30 Updated by Ovidiu Maxiniuc over 7 years ago

Igor Skornyakov wrote:

In fact the problem with case-sensitive expressions in the WHERE clause is more general. For example for each tt no-lock where "a" + tt.ci = exps where exps is case-sensitive variable is converted to
[...]
which may provide extra records.

Indeed, this seems to be a conversion-time flaw. The string literal "a" should not have been uppercased. The TextOps.concat() method is case-sensitive aware. I guess you will want to check the literals.rules (prog.string nodes and caseInsensitive annotation) for fixing it.

#31 Updated by Igor Skornyakov over 7 years ago

Ovidiu Maxiniuc wrote:

Indeed, this seems to be a conversion-time flaw. The string literal "a" should not have been uppercased. The TextOps.concat() method is case-sensitive aware. I guess you will want to check the literals.rules (prog.string nodes and caseInsensitive annotation) for fixing it.

The problem is not only with literal. In fact I've put notes 28 and 29 to a wrong task. Copied to #3187

#32 Updated by Greg Shah over 7 years ago

  • Target version changed from Milestone 17 to Performance and Scalability Improvements

#33 Updated by Eric Faulhaber over 5 years ago

As an intermediate measure to produce more correct query results short of the full complexity of the word index implementation, I have implemented a contains UDF to run within the database. This uses com.goldencode.p2j.persist.pl.Contains to parse the CONTAINS match expression syntax and evaluate each record to determine whether it matches the given expression. This solution does not implement word indexes, nor does it currently properly honor word break rules. The first pass implementation delimits words in the target field by the space character only.

The words in the target field are delimited each time the contains UDF is invoked. As a UDF, the contains function is opaque to the database query planner. However, since the previous implementation using the SQL LIKE operator required a leading wildcard in its match expression, the query planner could not use it to select an index anyway (though executing LIKE is considerably faster than invoking a Java UDF). So, this is not an efficient implementation for large tables or large data fields. As such, it does not replace the ultimate need for proper word index support; it is only suitable for lightweight use.

All instances of the CONTAINS operator are now converted to use the new contains UDF. The LIKE implementation, while it often produced the same results as the original 4GL, was flawed. Thus, it is no longer used.

#34 Updated by Eric Faulhaber about 5 years ago

It should be noted that error handling for the UDF version of CONTAINS support is not quite right. Any error in the search pattern syntax that raises an error condition causes an ErrorConditionException to be raised. Since the UDF operates inside the database server, normal error handling which would catch this exception and deal with it in the FWD server does not occur. Instead, this exception is wrapped in a SQL exception by its Java container (e.g., PL/Java) and is handled by the database server as an internal error, reported back through JDBC in a vendor-specific way. We would need to unwrap the JDBC exception on the FWD server, again in a database vendor-specific way, and handle it appropriately. This infrastructure is still missing.

#36 Updated by Greg Shah over 4 years ago

The following are notes about "word break tables" in the 4GL. The tables are a simple mapping between a character and a code (a.k.a. a word delimiter attribute) which specifies some behavior in the process of breaking arbitrary text values into a list of words.

Code Meaning Default for English "Locale"
BEFORE_DIGIT If a character marked with this code occurs before a character marked as a DIGIT, then it is part of a word. Otherwise it is treated as a delimiter. - , .
BEFORE_LET_DIG If a character marked with this code occurs before a character marked as a LETTER or as a DIGIT, then it is part of a word. Otherwise it is treated as a delimiter. n/a
BEFORE_LETTER If a character marked with this code occurs before a character marked as a LETTER, then it is part of a word. Otherwise it is treated as a delimiter. n/a
DIGIT Characters marked with this code are considered digits, which are always part of a word and never treated as delimiters. 0 - 9
IGNORE Characters marked with this code are removed from the text before words are delimited. Presumably, this also occurs in the comparison text. We need to check this. '
LETTER Characters marked with this code are considered letters, which are always part of a word and never treated as delimiters. A - Z, a - z
TERMINATOR If a character is not specified as one of the other codes, then it is treated as this code which means it is a delimiter. Constantin has described this as 'if the character can't be part of a word, then it must be a delimiter'. This suggests that for English, delimiters would include all control characters, most punctuation characters including asterisk, plus, equals, ampersand, caret, colon, semi-colon, double quote, less than, greater than, slash, backslash, pipe, tilde, backtick, parenthesis, curly and square brackets and all extended ASCII chars. We need to check this. n/a
USE_IT Characters marked with this code are considered part of a word and never treated as delimiters. _ @ # % $

My intention is to implement the equivalent for specific locales, but to embed the rules inside our code. We might want to implement custom tables by subclassing these base classes, and overriding the parent class methods as needed. This approach has the advantage that the subclasses can be built into a jar (maybe even the p2jpl.jar) that is accessible from PL/Java.

Constantin has also suggested that the configuration for these could be passed as a JVM option in PL/Java.

#37 Updated by Greg Shah over 4 years ago

Some useful references:

  • ElasticSearch - this is the most popular full text distributed search engine, implemented using Lucene and Java; it is used by Amazon for its ElasticSearch offering; please note that the licensing is not all open-source based, some modules have proprietary licenses so this may not be a real option.
  • Apache Solr - this is the second-most popular full text search, also implemented using Lucene and Java (Solr is actually part of the same Apache project as Lucene)
  • Database Specific - these have the supreme advantage of being fully integrated into SQL queries, which is probably a critical feature for our implementation (or at least a feature that will greatly simplify the implementation).

#38 Updated by Igor Skornyakov over 3 years ago

Creates task branch 1587a

#39 Updated by Ovidiu Maxiniuc over 3 years ago

The TEMP-TABLE:ADD-NEW-INDEX method depends on this, as noted in #4397. When task is finished, the word indexes must be tested with dynamic temp-tables, too, before updating the gaps.

#40 Updated by Ovidiu Maxiniuc over 3 years ago

  • Related to Feature #4397: add database attrs, methods and options added

#41 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Some useful references:

  • ElasticSearch - this is the most popular full text distributed search engine, implemented using Lucene and Java; it is used by Amazon for its ElasticSearch offering; please note that the licensing is not all open-source based, some modules have proprietary licenses so this may not be a real option.
  • Apache Solr - this is the second-most popular full text search, also implemented using Lucene and Java (Solr is actually part of the same Apache project as Lucene)
  • Database Specific - these have the supreme advantage of being fully integrated into SQL queries, which is probably a critical feature for our implementation (or at least a feature that will greatly simplify the implementation).
  • As mentioned before, search engines (such al Solr or Lucene) have a huge disadvantage that they cannot be integrated with SQL queries (at least easily)
  • I have no experience with full-text search support in SQL databases, but after a quick look at the documentation, I've got the impression that this functionality is much more high-level than word indexes in OE. I expect that it will be very tricky (if possible at all) to achieve compatibility. Please also note that full-text support is 100% proprietary and looks to be based on very different ideas in different databases. This means that for each database, we will have different corner cases, and I'm afraid that we'll drown in these details.
  • With the above in mind, I again suggest a purely "hand-made" approach described in #1587-11 and #1587-17. Since now we do not depend on Hibernate, implementing this can be more comfortable. With this approach (especially in a version based on auxiliary VIEWs), the database-specific part will be minimal (at this moment, I do not see any), and everything will be under our control.

#42 Updated by Greg Shah over 3 years ago

I have no experience with full-text search support in SQL databases, but after a quick look at the documentation, I've got the impression that this functionality is much more high-level than word indexes in OE. I expect that it will be very tricky (if possible at all) to achieve compatibility.

Please make a list of the features which you see as difficult.

If it wasn't clear in #1587-37, my intuition was that the database-specific full text search would be the best fit since we could integrate everything right into the rest of the SQL query. We definitely care about compatibility. But it must also be very fast. The entire reason customers use this is because it was faster in OE than normal queries. I would similarly expect that this is the same reason that PostgreSQL and other databases implement direct support for full text search. Eric will have to comment on your thoughts in #1587-11 and #1587-17, but my "gut reaction" is that if it was easy to achieve the proper performance using this approach then full text search wouldn't be needed to be built into the databases.

#43 Updated by Greg Shah over 3 years ago

The specific solution does not matter to me so long as it is compatible and fast. Anything (Igor's idea or the built-in database support) that does not require a 3rd party extra technology/server, is also better.

#44 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

I have no experience with full-text search support in SQL databases, but after a quick look at the documentation, I've got the impression that this functionality is much more high-level than word indexes in OE. I expect that it will be very tricky (if possible at all) to achieve compatibility.

Please make a list of the features which you see as difficult.

As I wrote before I have no experience with full-text search support in RDBMS but the first thing that look very questionable (regrading PostreSQL):
  • I understand that standard way of using full-text server i PostgreSQL is built around the notion of tsvector datatype and to_tsvertor and to_tsquery functions. They have their own non-trivial semantics based in particular on the notion of a word stem. This is advanced logic but it for our needs we have to make it much more "primitive" and at this moment I do not see an obvious way to do it. 4GL used much more low-level notion of the word-break tables which have no obvious counterpart in PostgreSQL. MS SQL full-text search looks even more advanced.

If it wasn't clear in #1587-37, my intuition was that the database-specific full text search would be the best fit since we could integrate everything right into the rest of the SQL query. We definitely care about compatibility. But it must also be very fast. The entire reason customers use this is because it was faster in OE than normal queries. I would similarly expect that this is the same reason that PostgreSQL and other databases implement direct support for full text search. Eric will have to comment on your thoughts in #1587-11 and #1587-17, but my "gut reaction" is that if it was easy to achieve the proper performance using this approach then full text search wouldn't be needed to be built into the databases.

What I suggest should be fast as it uses standard indexes for an auxiliary "word tables" with a very simple structure. As I wrote the full-text search implemented in popular engines and databases is much more advanced thing than 4GL CONTAINS operator. Using them to implement such a logically simple functionality looks a littly bit what in Russian is called "shooting a canon at sparrows". And I would like to emphasize again that this support is very different in different databases which means that we'll have to implement the CONTAINS support almost from scratch for all support databases.

#45 Updated by Greg Shah over 3 years ago

I like this Russian saying "shooting a canon at sparrows".

As of this moment, we don't need to implement our own word break tables. We can avoid this feature for now but we must consider how we would support it. Can you explain how your extra word tables would get populated? I presume we must maintain these manually. That means every edit of the "field" turns into some sort of operation like:

1. break the new text into words
2. remove all old + insert new OR calculate differences and only remove/update/insert as needed OR something else?

Do I understand this correctly?

#46 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

As of this moment, we don't need to implement our own word break tables. We can avoid this feature for now but we must consider how we would support it. Can you explain how your extra word tables would get populated? I presume we must maintain these manually. That means every edit of the "field" turns into some sort of operation like:

1. break the new text into words
2. remove all old + insert new OR calculate differences and only remove/update/insert as needed OR something else?

Do I understand this correctly?

Yes, this is correct. I think that it will be more efficient to drop old entries and add new ones in a batch insert. This can be done in a scope of a single transaction. I also think that it can be done in a database AFTER trigger. I hope that fields with word indexes are not updaed too frequently in real 4GL applications as it shoud be not be for free anyway.

#47 Updated by Eric Faulhaber over 3 years ago

Greg Shah wrote:

The specific solution does not matter to me so long as it is compatible and fast. Anything (Igor's idea or the built-in database support) that does not require a 3rd party extra technology/server, is also better.

I completely agree.

As to whether to go with the 3 (or more) database-specific implementations which must made to work seamlessly under an abstraction layer in FWD, I will note that all previous attempts to integrate existing technologies which are almost, but not quite exactly what we need, have gone badly for us in the long run. So, like Igor, I am leaning toward a custom approach.

That being said, before we go this route, I'd like to know if we are ignoring something which might actually be useful, without proper vetting. Igor, please elaborate on what appears to be difficult about using the database-specific solutions. This does not have to be an exhaustive study, just enough to know whether we should consider these technologies further.

#48 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

That being said, before we go this route, I'd like to know if we are ignoring something which might actually be useful, without proper vetting. Igor, please elaborate on what appears to be difficult about using the database-specific solutions. This does not have to be an exhaustive study, just enough to know whether we should consider these technologies further.

OK, but to provide a more detailed analysis I will need some time to learn more about full-text support at least for PostgreSQL. As I've already mentioned I've never worked with it before.
What I understand now is that (if my understanding is correct) the full-text search in the databases is oriented on search in "real" human-readable texts (like e.g. Web search). So it is at least "too smart" for CONTAINS support (it considers possible misspelling, words' similarity, some "common" words are ignored, etc). May be it is possible to suppress this "cleverness" but even in this case I think that the overhead will be significant. "Shooting canon at sparrows" is not only worthless, but is also expensive.
But my main objection is still that it is a highly proprietiery solutions and are very different in different databases. And there is a risk that we'll have to adjust our hacks and workarounds for every new release of every supported database.

#49 Updated by Greg Shah over 3 years ago

What is your estimate of the time needed to implement your proposed solution? If it is small enough, then it may make sense to go ahead with that and see the results.

#50 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

What is your estimate of the time needed to implement your proposed solution? If it is small enough, then it may make sense to go ahead with that and see the results.

I think that it will take about a week for runtime support. I'm not sure about conversion (generation of the words' table) and import.db changes at this moment, but this should not be a big deal. And this is without work-break tables' support.

#51 Updated by Igor Skornyakov over 3 years ago

Well, it looks that FTS support in PostgreSQL is flexible enough and it will be possible to use it for CONTAINS support. Please note that this will most likely require the implementation of the custom native parser (https://www.postgresql.org/docs/10/sql-createtsparser.html) in "C" language. Such parser will be necessary for the work-break tables' support.
I've not analyzed the situation with H2 and MS SQL Server so far.

#52 Updated by Eric Faulhaber over 3 years ago

  • Status changed from New to WIP
  • Assignee set to Igor Skornyakov

Considering that you think it will take around a week, please go ahead with your proposed solution, Igor. Thanks.

#53 Updated by Igor Skornyakov over 3 years ago

Thank you. I will start with the conversion of the logical expression (in a format required by the CONTAINS operator) to CNF (Conjunctive Normal Form). See #1587-11.

#54 Updated by Eric Faulhaber over 3 years ago

Currently, we convert the ABL WHERE clause to FQL which uses the CONTAINS UDF. Is there any reason to refactor this? It seems to have all the needed information. What is the plan to go from this to your expanded SQL form? I imagine you will need to make changes to the FqlToSqlConverter, and more specifically, to the FQL parser. Ovidiu can advise in this regard, if necessary.

#55 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Currently, we convert the ABL WHERE clause to FQL which uses the CONTAINS UDF. Is there any reason to refactor this? It seems to have all the needed information. What is the plan to go from this to your expanded SQL form? I imagine you will need to make changes to the FqlToSqlConverter, and more specifically, to the FQL parser. Ovidiu can advise in this regard, if necessary.

Eric.
Of course we need to refactor this if we what to use indexes instead of applying CONTAINS UDF to the complete result set. Effectively what a I suggest is that FWD counterpart of the 4GL word index will be a normal index for an auxiliary word table. The converter will be used for processing the CONTAINS argument to the WITH clause (or selection of the appropriate auxiliary view as described in #1587-11 and 1857-17. The changes in the FQL/SQL will be minimal. Of course this will be used instead of CONTAINS UDF. I plan to ask Ovidiu for advise shortly regarding FqlToSqlConverter, after the converter will be implemented.

#56 Updated by Greg Shah over 3 years ago

Perhaps it is better to avoid changes at conversion time and instead to implement the rewriting at runtime. This would allow us to change the implementation later without changing the conversion.

#57 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Perhaps it is better to avoid changes at conversion time and instead to implement the rewriting at runtime. This would allow us to change the implementation later without changing the conversion.

Greg,
I do not plan to make any changes in conversion, apart from generation of DDL for word tables. All will be done at runtime.

#58 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

Greg Shah wrote:

Perhaps it is better to avoid changes at conversion time and instead to implement the rewriting at runtime. This would allow us to change the implementation later without changing the conversion.

Greg,
I do not plan to make any changes in conversion, apart from generation of DDL for word tables. All will be done at runtime.

My question was about whether we needed to refactor the conversion of ABL WHERE clauses to FQL WHERE clauses. Of course, refactoring is necessary to get from the current FQL we have at runtime to the SQL you are proposing, but if we can start the runtime generation of SQL with FQL input like:

... where ... contains(<dmo.property>, <expression>) ...

that would avoid having to refactor the conversion of ABL WHERE clauses. I think we're on the same page now, but let me know if we are not.

#59 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

My question was about whether we needed to refactor the conversion of ABL WHERE clauses to FQL WHERE clauses. Of course, refactoring is necessary to get from the current FQL we have at runtime to the SQL you are proposing, but if we can start the runtime generation of SQL with FQL input like:

[...]

that would avoid having to refactor the conversion of ABL WHERE clauses. I think we're on the same page now, but let me know if we are not.

Ah, sorry I have not understood you. I do not know all the details at this moment, but, if my understanding of Ovidiu's explanations is correct, there is no need to change ABL to FQL conversion. All can be done more or less at step 5 in Ovidiu's document. Please note again that at this moment I can miss some details.

#60 Updated by Ovidiu Maxiniuc over 3 years ago

Igor, here is a how I would tackle this issue.
First, I would create a simple testcase like:

define temp-table ttwi
    field if1 as character
    index wi1 is word-index if1 asc.

create ttwi.
if1 = "small caterpillar".
create ttwi.
if1 = "big elephant".
create ttwi.
if1 = "deer with antler".

for each ttwi where if1 contains "cat*|ant*":
    message if1.
end.

And run it against ABL and current FWD. On FWD side, the conversion is most important, to see the current flaws and identify the missing or better said wrongly implemented parts. For sake of simplicity, my testcase uses temp-tables, you might want to use permanent tables first, in order to have a SQL console to test your hand-made queries. Use the PSQL consle to find one that best emulates the behaviour for a simple testcase (may be simpler than my testcase) and then add more complex syntax. Keep a table with SQL queries you find (we can give you a hand with their validation). Then compare these with what FWD does at the moment in order to identify how the query should be adjusted/generated and at which level. Maybe it's best to start the changes only after you have a good grasp of the final query structure for more complex testcases: multiple contains operators in same query combined by AND, OR operators, with extent fields (!!) and joined queries (optimized, server-side queries).
Do not forget that H2 dialect must also be supported at a minimum beside PSQL because the TEMP-TABLEs will always use H2. Luckily the syntax will be identical or not be too different.

I might not get this right, but I understand that the DDL must also be enhanced, by adding the special tables/indexes for each word-index. How are these populated? I expect that their management will be done at insert/update/delete. Maybe using some triggers? Please correct me if I am wrong here.

#61 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Igor, here is a how I would tackle this issue.
First, I would create a simple testcase like:
[...]
And run it against ABL and current FWD. On FWD side, the conversion is most important, to see the current flaws and identify the missing or better said wrongly implemented parts. For sake of simplicity, my testcase uses temp-tables, you might want to use permanent tables first, in order to have a SQL console to test your hand-made queries. Use the PSQL consle to find one that best emulates the behaviour for a simple testcase (may be simpler than my testcase) and then add more complex syntax. Keep a table with SQL queries you find (we can give you a hand with their validation). Then compare these with what FWD does at the moment in order to identify how the query should be adjusted/generated and at which level. Maybe it's best to start the changes only after you have a good grasp of the final query structure for more complex testcases: multiple contains operators in same query combined by AND, OR operators, with extent fields (!!) and joined queries (optimized, server-side queries).
Do not forget that H2 dialect must also be supported at a minimum beside PSQL because the TEMP-TABLEs will always use H2. Luckily the syntax will be identical or not be too different.

I might not get this right, but I understand that the DDL must also be enhanced, by adding the special tables/indexes for each word-index. How are these populated? I expect that their management will be done at insert/update/delete. Maybe using some triggers? Please correct me if I am wrong here.

Thank you, Ovidiu.
I'm afraid that with FWD your test will fail because of the known issue with the CONTAINS UDF. I plan to fix it as soon as my expression to CNF converter will be ready (to avoid double work).
Otherwise, the plan you suggest looks fine and I had something similar in mind but I planned to use permanent databases (PostgreSQL and H2) with DBeaver as a client (creating word tables by hands at the first step). At this moment I'm looking how CONTAINS is converted in a large customer app and so far I do not see problems with FQL rewriting - it should be more or less straightforward.
Yes, we will need to generate DDL for auxiliary word tables and their primary key. At this moment I consider using triggers for insert/update/delete and (maybe) additional logic on import.

#62 Updated by Igor Skornyakov over 3 years ago

A strange question. We need to generate unique names for the word tables. I would prefer to have them more or less meaningful and easily calculated by the table name and name of the field with a word index. Do we have any standard approach for this or similar situations?
I understand that it is a minor issue but for some reasons, it comes to my head again and again ))
Thank you.

#63 Updated by Ovidiu Maxiniuc over 3 years ago

Are these helper tables specific to a table we already have? You might add a specific suffix, following the normalized field example. For instance, the primary table customer might have customer__5 and customer__10 if there are some fields defined as extent 5 and extent 10. We can use customer__col1..customer__colN or something similar. The name should be unique, it is easy to be compose their names and, since the column names are not numbers, they will not overlap with the extent secondary tables.

#64 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Are these helper tables specific to a table we already have? You might add a specific suffix, following the normalized field example. For instance, the primary table customer might have customer__5 and customer__10 if there are some fields defined as extent 5 and extent 10. We can use customer__col1..customer__colN or something similar. The name should be unique, it is easy to be compose their names and, since the column names are not numbers, they will not overlap with the extent secondary tables.

Thank you, Ovidiu. This was what I was thinking about. The problem is that AFAIK all databases have limitations for the names' length. Are you sure that such concatenations will not be too long?
Thank you,

#65 Updated by Ovidiu Maxiniuc over 3 years ago

I did a quick research and it looks like the H2 does not have any limitations related to maximum identifier length (table name, column name, and so on). OTOH, PSQL is limited to 63 bytes (if this is a double-byte, it means 31 chars, but this needs to be tested) but can be increased by recompiling PostgreSQL. We will most likely not do that. But I guess 30 characters should be enough for almost all cases of <table>__<col><N> identifier. For rare cases when this limit is reached, we can use hints to shorten the table of column SQL name.

#66 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

I did a quick research and it looks like the H2 does not have any limitations related to maximum identifier length (table name, column name, and so on). OTOH, PSQL is limited to 63 bytes (if this is a double-byte, it means 31 chars, but this needs to be tested) but can be increased by recompiling PostgreSQL. We will most likely not do that. But I guess 30 characters should be enough for almost all cases of <table>__<col><N> identifier. For rare cases when this limit is reached, we can use hints to shorten the table of column SQL name.

I see. Thank you, Ovidiu.

#67 Updated by Igor Skornyakov over 3 years ago

Implemented conversion of the logical expression accepted by CONTAINS operator to the Reverse Polish Notation (RPN), Conjunctive Normal Form (CNF), or Disjunctive Normal Form (DNF).
The conversion to RPN is used in a re-worked version of the CONTAINS UDF which is working now (I hope). The conversion to CNF and DNF are based on RPN.
Please note that at this moment CONTAINS operator doesn't support negation. The converter supports it but the corresponding parts in the Lexer code are commented on now. The negation support was tested and can be easily enabled.
Committed to 1587a rev.11353.

#68 Updated by Igor Skornyakov over 3 years ago

As per Eric's request created a task branch 1587b from 3821c rev. 11831.

#69 Updated by Igor Skornyakov over 3 years ago

Committed changes from #1587-67 to 1587b rev.11832.

#70 Updated by Igor Skornyakov over 3 years ago

Created words Java UDFs for splitting a string into an array of (different) words. Committed to 1587b rev. 11833

#71 Updated by Igor Skornyakov over 3 years ago

Bases on the Ovidiu's example (#1587-60) I've created a test with Postgres database.
  • The 4GL table definition is:
    ADD TABLE "ttwi" 
      AREA "Schema Area" 
      DUMP-NAME "ttwi" 
    
    ADD FIELD "if1" OF "ttwi" AS character 
      FORMAT "x(128)" 
      INITIAL "" 
      POSITION 2
      MAX-WIDTH 256
      ORDER 10
    
    ADD INDEX "wi1" ON "ttwi" 
      AREA "Schema Area" 
      WORD
      INDEX-FIELD "if1" ASCENDING 
    
  • The definition of auxiliary Postgres UDFs are:
    create or replace function public.words(text)
     returns text[]
     language java
    as $function$java.lang.String[]=com.goldencode.p2j.persist.pl.Functions.words(java.lang.String)$function$
    ;
    
    create or replace function public.words(text, bool)
     returns text[]
     language java
    as $function$java.lang.String[]=com.goldencode.p2j.persist.pl.Functions.words(java.lang.String, boolean)$function$
    ;
    
    create or replace function words (
       recid int8, txt text
    ) 
    returns table ( parent int8, word text ) language 'plpgsql' as
    $$
    declare arr text[];
    declare w text;
    begin
        arr = words(txt);
        foreach w in array arr loop
            parent = recid;
            word = w;
            return next;
        end loop;
    end
    $$
    ;
    
  • The DDL for a manually createed words' table is:
    CREATE TABLE public.ttwi_if1 (
        parent int8 NOT NULL,
        word text NOT NULL,
        CONSTRAINT ttwi_if1_pk PRIMARY KEY (parent, word),
        CONSTRAINT ttwi_if1_fk FOREIGN KEY (parent) REFERENCES ttwi(recid) ON UPDATE CASCADE ON DELETE CASCADE
    );
    CREATE INDEX ttwi_if1_word_idx ON public.ttwi_if1 USING btree (word);
    
  • For each word index the following triggers should be defined (generated):
    create or replace function ttwi_if1_trg()
      returns trigger 
      language plpgsql
      as
    $$
    begin
        delete from ttwi_if1 where ttwi_if1.parent = new.recid;
        insert into ttwi_if1 select * from public.words(new.recid, new.if1);
        return new;
    END;
    $$;
    
    create trigger ttwi_if1_ins after
    insert
        on
        public.ttwi for each row execute procedure ttwi_if1_trg();
    
    create trigger ttwi_if1_upd after
    update of if1
        on
        public.ttwi for each row execute procedure ttwi_if1_trg();    
    
  • The test is simple:
    for each ttwi where if1 contains "(cat* | ant*) & (small | deer)":
        message if1.
    end.
    
  • The table was populated with data from the #1587-60 sample.
  • The (fixed) CONTAINS UDF in the SQL statement
    SELECT ttwi.recid, ttwi.if1 FROM public.ttwi
    where CONTAINS(if1, '(cat* | ant*) & (small | deer)')
    

    returns a correct result set as well as the statement
    select ttwi.recid, ttwi.if1 from public.ttwi
    join ttwi_if1 w1 on (w1.parent = ttwi.recid and (w1.word like 'cat%' or w1.word like 'ant%'))
    join ttwi_if1 w2 on (w2.parent = ttwi.recid and (w2.word = 'small' or w1.word = 'deer'))
    
This means that the rewritten SQL statement with CONTAINS should look like this:
  1. All CONTAINS(...) substrings should be replaced with "1 = 1"
  2. The parameters corresponding to the CONTAINS(...) arguments and their placeholders should be converted to CNF (the arguments corresponding to the '?' placeholders) should be removed from the arguments' array,
  3. The JOIN <words table> ON (<words table>.parent = <table>.recid AND (<or clause from CNF>)) should be appended to the SQL string for each "AND" clause of every CNF form of CONTAINS operators of the initial query.

BTW: I've noticed that in some cases a CONTAINS operator with a literal argument is converted to the CONTAINS UDF with a literal argument. I think that the re-writing logic described above will be a little simpler if we will avoid such a shortcut and always use a '?' placeholder for the CONTAINS UDF second argument.

#72 Updated by Igor Skornyakov over 3 years ago

Well, the last section of #1587-71 regarding SQL re-writing is wrong, It works only for some simple queries and does not if e.g. CONTAINS is a part of a logical expression that is more complicated than just conjunction of several terms.
A more universal version of the re-writing of

select ttwi.recid, ttwi.if1 from public.ttwi
where contains(if1, '(cat* | ant*) & (small | deer)')

is
with keys as (
   select recid from public.ttwi t 
      join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word like 'cat%' or w1.word like 'ant%'))
      join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = 'small' or w2.word = 'deer'))
)
select ttwi.recid, ttwi.if1 from public.ttwi, keys
where ttwi.recid = keys.recid

or
select ttwi.recid, ttwi.if1 from public.ttwi
where ttwi.recid in (
   select recid from public.ttwi t 
      join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word like 'cat%' or w1.word like 'ant%'))
      join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = 'small' or w2.word = 'deer'))
)

So we either prepend the final SQL statement with a WITH clause, add the corresponding temporary table to a FROM list and replace CONTAINS UDF invocation with <table>.recid = <with table>.recid. Or replace CONTAINS UDF invocation with <table>.recid IN <select>

#73 Updated by Igor Skornyakov over 3 years ago

From the mails:
Igor:

BTW: what is the rule for generating the aliases? I would like to avoid parsing the FROM clause to figure the table name which I need to create a words' table > > name. Can I just drop the suffix starting from '__impl'?

Ovidiu:

As noted in a previous chat, the generation of the SQL is a multi-stage process. The last one is the FqlToSqlConverter which uses the found aliases, but adds a > supplementary numeric suffix just to be sure they are unique (see sqlTableAliasMap map). The "__Impl" was added in a previous stage when the dmo was replaced by > the implementation class.

The suffix actually cannot be dropped (it makes aliases unique in case of join/subqueries). However, I think you should do the processing at a much earlier > > > stage, most likely in HQLPreprocessor, guessing from your intention. How do you expect the below query should look when it will be functional?

Igor:
Initially I was thinking about re-writing at the HQL generation step, but found it too complicated. In the final SQL contains(...) should be replaced with something like this:

ttwi.recid in (
   select recid from ttwi t 
      join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word like 'cat%' or w1.word like 'ant%'))
      join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = 'small' or w2.word = 'deer'))
)

Another option is to prepend a WITH clause
with keys as (
   select recid from public.ttwi t 
      join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word like 'cat%' or w1.word like 'ant%'))
      join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = 'small' or w2.word = 'deer'))
)

, append keys to the FROM clause and replace contains(...) with ttwi.recid = keys.recid

See #1587-72.
I know how to generate the replacement once I know the names of the table and the field and the value of the argument.

#74 Updated by Ovidiu Maxiniuc over 3 years ago

I think the best moment to intercept the contain function and replace it with a subtree is in HQLPreprocessor. At this moment (in HQLPreprocessor.preprocess(), before calling mainWalk()), you will have for my initial example, a tree like:

contains [FUNCTION]:null @1:1
   upper [FUNCTION]:null @1:10
      ttwi [ALIAS]:null @1:16
         if1 [PROPERTY]:null @1:21
   'CAT*|ANT*' [STRING]:null @1:27

So, you have access at:
  • contains function itself to trigger the replacement;
  • ttwi alias and if1 property you need to use to create the replacement subquery;
  • the 'CAT*|ANT*' string, I.e the original pattern you need to parse to create the subquery.

What you need to do? First to intercept the event: contains [FUNCTION]. From my experience, this should be done when going-up the tree, after the children were processed. The processing of contains string will probably need to be processed recursively (if it has parenthesis) and carefully construct a subtree you will finally graft in place of old contains node.

#75 Updated by Ovidiu Maxiniuc over 3 years ago

(continued, I pressed Submit instead of Preview)
If this is done at this stage, you really do not care about the suffices you noticed. FWD will continue to process the query predicate and construct the final query as usual. If you introduce some exotic syntax (like with keys, which I do not think works now), we will need to add support in final FqlToSqlCOnverter. But do not care about this at this time.

#76 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

(continued, I pressed Submit instead of Preview)
If this is done at this stage, you really do not care about the suffices you noticed. FWD will continue to process the query predicate and construct the final query as usual. If you introduce some exotic syntax (like with keys, which I do not think works now), we will need to add support in final FqlToSqlCOnverter. But do not care about this at this time.

Thank you. Ovidiu.
I've started to work with the late step because I see that HQL statement contains DMO classes' names and we actually do not need to generate DMOs for the auxiliary words' tables. Will it be a problem?
Thank you.

#77 Updated by Ovidiu Maxiniuc over 3 years ago

No, it is probably OK, too. My idea was keep FqlToSql converter cleaner (only to do conversion, not enhancements of code). But since the rest of FWD will not be aware of the auxiliary words' tables and their DMOs, it turns that FqlToSqlConverter is probably the best location for this work.

So, after parsing the tree will contains something like above, but a bit 'preprocessed'. I put a breakpoint in FqlToSqlConverter.toSQL() , after parsing and I have the following:

               contains_1 [FUNCTION]:null @1:66
                  upper [FUNCTION]:null @1:77
                     rtrim [FUNCTION]:null @1:83
                        ttwi.if1 [PROPERTY]:null @1:89
                  'CAT*|ANT*' [STRING]:null @1:101

There is no Impl suffix at this place so it should make it easier for you. However, the property text must be parsed to extract the table and property. I think you have all you need to construct the tree as noted above. I expect your processing will go in generateExpression(), also on ascent, but you must be careful to ignore the subtree on descent. If you construct the replacement query/ies string instead of contains_1 [FUNCTION] and write it directly to global StringBuilder sb everything should be fine.

#78 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

No, it is probably OK, too. My idea was keep FqlToSql converter cleaner (only to do conversion, not enhancements of code). But since the rest of FWD will not be aware of the auxiliary words' tables and their DMOs, it turns that FqlToSqlConverter is probably the best location for this work.

So, after parsing the tree will contains something like above, but a bit 'preprocessed'. I put a breakpoint in FqlToSqlConverter.toSQL() , after parsing and I have the following:
[...]

There is no Impl suffix at this place so it should make it easier for you. However, the property text must be parsed to extract the table and property. I think you have all you need to construct the tree as noted above. I expect your processing will go in generateExpression(), also on ascent, but you must be careful to ignore the subtree on descent. If you construct the replacement query/ies string instead of contains_1 [FUNCTION] and write it directly to global StringBuilder sb everything should be fine.

Got it. Thanks a lot, Ovidiu!

#79 Updated by Igor Skornyakov over 3 years ago

Implemented SQL re-write for CONTAINS. According to Ovidiu, this is done not in the right place, but it should work in most cases.

Committed to 1587b revision 11834.

Working now on the move to the right place (see #1587-77).

Please note that the current implementation works only if the second argument of the CONTAINS UDF in the generated code is a ? placeholder, not a string literal. I believe that it makes sense to change conversion so that it will be always the case.
I also think that it makes sense to make another small change in the conversion so that the first argument of the CONTAINS UDF is always just a reference to a field. I do not understand the reason why we wrap this argument in the upper/trim calls. If the implementation is UDF-based it is more natural to do such a conversion in the UDF. For any kind of word index emulation such wrapping simply doesn't make sense and doesn't match the 4GL semantics of the CONTAINS operator as I understand it.
At this moment I do not understand the details of the CONTAINS operator conversion. Where should I look first?
Thank you.

#80 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

At this moment I do not understand the details of the CONTAINS operator conversion. Where should I look first?

annotations/where_clause.rules converts the operator to the contains UDF syntax.

I would suggest writing the simplest possible test case using CONTAINS, converting it, and reviewing the AST.

#81 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

At this moment I do not understand the details of the CONTAINS operator conversion. Where should I look first?

annotations/where_clause.rules converts the operator to the contains UDF syntax.

I would suggest writing the simplest possible test case using CONTAINS, converting it, and reviewing the AST.

Thank you, Eric. I already have such a test. Will try to debug the conversion.

#82 Updated by Eric Faulhaber over 3 years ago

It may also be helpful to review the specific diffs in:

  • rev 10173: this added the initial conversion infrastructure of the CONTAINS operator to use the SQL LIKE operator;
  • rev 11282.1.32: this adjusted the conversion support added in rev 10173 to instead convert the CONTAINS operator to the contains UDF.

#83 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

It may also be helpful to review the specific diffs in:

  • rev 10173: this added the initial conversion infrastructure of the CONTAINS operator to use the SQL LIKE operator;
  • rev 11282.1.32: this adjusted the conversion support added in rev 10173 to instead convert the CONTAINS operator to the contains UDF.

Oh, this is very useful information indeed. Thank you!

#84 Updated by Ovidiu Maxiniuc over 3 years ago

Igor, I did a quick review of 1587b/11834.

Igor Skornyakov wrote:

Implemented SQL re-write for CONTAINS. According to Ovidiu, this is done not in the right place, but it should work in most cases. Committed to 1587b revision 11834.

Indeed, intercepting the contains function with regular expressions is not the best solution because of multiple reasons:
  • FWD might receive have a hard-coded string parameter with a similar content which will trigger a false positive result (think of this like a SQL injection attack);
  • at this moment the query syntax might have become quite complex and your code is clearly not complex enough to handle it;
  • not ultimately, it is a performance hit which affects all queries!

Working now on the move to the right place (see #1587-77).

Just let me know when you commit and I will review it.

Please note that the current implementation works only if the second argument of the CONTAINS UDF in the generated code is a ? placeholder, not a string literal. I believe that it makes sense to change conversion so that it will be always the case.

Probably this is possible but I do not think it is justified. If the function node is replaced in the FQL tree you have access to the pattern as well. However, I wonder whether this second parameter can be another property/field or the result of a server-side function/operator, in which case the value is not available yet for processing.

I also think that it makes sense to make another small change in the conversion so that the first argument of the CONTAINS UDF is always just a reference to a field. I do not understand the reason why we wrap this argument in the upper/trim calls. If the implementation is UDF-based it is more natural to do such a conversion in the UDF. For any kind of word index emulation such wrapping simply doesn't make sense and doesn't match the 4GL semantics of the CONTAINS operator as I understand it.

The upper/trim wrapping is done to match the semantics of P4GL string operations. By default (if not expressly requested by programmer), string operations are case-insensitive, that why the upper is injected, except for case-sensitive fields. The trim is required because the spaces on the right are ignored. If you observed that these are not required with this operator, then you can drop them when rewriting the contains node.

#85 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Probably this is possible but I do not think it is justified. If the function node is replaced in the FQL tree you have access to the pattern as well. However, I wonder whether this second parameter can be another property/field or the result of a server-side function/operator, in which case the value is not available yet for processing.

This is an interesting question. 4GL documentation is not clear enough in this regard but in all samples the argument of the CONTAINS involves only literals or client-side variables. Obviously, the approach based on parsing the expression and re-writing the SQL statement will not work if the argument involves server-side data (functions or other fields). In this case, I think we have no choice other than to leave the statement as-is and use CONTAINS UDF.
Please note that I've not seen this case in the large customer application I have analyzed.
In addition, if I use a field name as the CONTAINS argument, 4GL reports a syntax error, although a little bit strange:

CONTAINS allowed only for word indexed field references. (3398)
**  Could not understand line 10. (196)

#86 Updated by Greg Shah over 3 years ago

In addition, if I use a field name as the CONTAINS argument, 4GL reports a syntax error, although a little bit strange:

This is the documented limitation in the 4GL. CONTAINS can only search word indexes. It cannot be used on fields that are not a word index.

#87 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

This is the documented limitation in the 4GL. CONTAINS can only search word indexes. It cannot be used on fields that are not a word index.

Yes, but the error I've mentioned was for

def var expr as char init "(cat* | ant*) & (small | deer)".
for each ttwi where (if1 contains if1):
    message if1.
end.

while
def var expr as char init "(cat* | ant*) & (small | deer)".
for each ttwi where (if1 contains expr):
    message if1.
end.

works fine. The field ttwi.if1 has word index.

#88 Updated by Greg Shah over 3 years ago

I see. Yes, that is weird. Do you get the same result with some_other_word_index_field contains if1? In other words, does this only occur because if1 contains if1 must always be true?

#89 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

I see. Yes, that is weird. Do you get the same result with some_other_word_index_field contains if1? In other words, does this only occur because if1 contains if1 must always be true?

I've not tested this but I will. From the other side, I cannot imagine how the word index can help if the CONTAINS argument value is not the same for all records. It can be a constant server-side expression and in this case, the approach with SQL re-write can be applicable (at least if we ignore transaction-related issues). However, I do not know 4GL well enough to imagine how such a query should look like to test if it is acceptable for 4GL.

#90 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Indeed, intercepting the contains function with regular expressions is not the best solution because of multiple reasons:

I do not insist on this solution, however:

  • FWD might receive have a hard-coded string parameter with a similar content which will trigger a false positive result (think of this like a SQL injection attack);

The following regex '(?:[^']|'')*'|"(?:[^"]|"")*" will find all SQL string literals in the query so we can exclude this case. Please note also that using string literals in the generated SQL statements as parameters' values is often considered a bad idea (again SQL injection!).

  • at this moment the query syntax might have become quite complex and your code is clearly not complex enough to handle it;

Sorry, I do not understand what you mean. The re-write is local and affects only CONTAINS UDF call and does not care about the other parts of the statement at all. Do you mean that the logic for extracting table name from the alias is not 100% reliable?

  • not ultimately, it is a performance hit which affects all queries!

Actually what affects all queries is just an attempt to find contains(..., ?) inside the SQL string. I do not think that this overhead is even comparable with the cost of executing the SQL query.

#91 Updated by Ovidiu Maxiniuc over 3 years ago

The way containsQuery extracts the tblName does not seem sound to me. The "__impl" might not be part of the tblAlias. It will be difficult to maintain in the future.
Then, we went a long way with #4011 to improve database accesses. Even if the matching for the CONTAINS pattern is fast related to execution of its query, the way is is written now is a burden for all queries. 99% of them don't even have this function. The effort to create the matcher and scan the already built query will be in vain. At hundreds or thousands of queries creating an UI screen, it will be noticeable.

#92 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

The way containsQuery extracts the tblName does not seem sound to me. The "__impl" might not be part of the tblAlias. It will be difficult to maintain in the future.

I understand, thank you.

Then, we went a long way with #4011 to improve database accesses. Even if the matching for the CONTAINS pattern is fast related to execution of its query, the way is is written now is a burden for all queries. 99% of them don't even have this function. The effort to create the matcher and scan the already built query will be in vain. At hundreds or thousands of queries creating an UI screen, it will be noticeable.

Well, since we decided do not to touch conversion at this moment, we will need to discover the presence of the CONTAINS UDF call in a query string anyway. And this will affect all queries. And I understand that the Java regex engine is pretty fast. But, as I said, I do not insist on the existing implementation and now follow your advice.

#93 Updated by Ovidiu Maxiniuc over 3 years ago

Igor,
if you work at AST level, you do not need to parse the full SQL query as it is already represented as a tree. Simply check for FUNCTION type node with contains text. We have the contains() method overloaded in Functions so, in my example from #1587-77 it is decorated as h2 cannot handle method overloading. However, you will remove these methods from Functions anyway, so the function name will not be affected by preprocessor.

#94 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Igor,
if you work at AST level, you do not need to parse the full SQL query as it is already represented as a tree. Simply check for FUNCTION type node with contains text. We have the contains() method overloaded in Functions so, in my example from #1587-77 it is decorated as h2 cannot handle method overloading. However, you will remove these methods from Functions anyway, so the function name will not be affected by preprocessor.

I see. Thank you, Ovidiu.

#95 Updated by Greg Shah over 3 years ago

Well, since we decided do not to touch conversion at this moment

I think you've already resolved this. But I want to be clear: if there is even a slight improvement in runtime performance we will gladly modify the conversion.

#96 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Well, since we decided do not to touch conversion at this moment

I think you've already resolved this. But I want to be clear: if there is even a slight improvement in runtime performance we will gladly modify the conversion.

Greg,
It was your suggestion to do everything at runtime (see #1587-56).
I plan to make minor changes in the conversion (see #1587-79) but the generation of the CONTAINS UDF will still be in place.

#97 Updated by Greg Shah over 3 years ago

I understand. But we are optimizing performance here and we will spend extra development time and make conversion level changes if that obtains better performance.

#98 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

I understand. But we are optimizing performance here and we will spend extra development time and make conversion level changes if that obtains better performance.

I hope that with the approach suggested by Ovidiu the runtime overhead will be minimal and there will be no need for serious conversion changes. I think I will finish the implementation today.

#99 Updated by Igor Skornyakov over 3 years ago

I've finished the CONTAINS UDF call re-writing as per Ovidiu's advice.
Committed to 1587b revision 11835.
However, during thorough testing, I've noticed a strange thing:
Consider the following 4GL program:

def var expr as char init "(cat* | ant*) & (small | deer)".
for each ttwi where (if1 contains expr):
    message if1.
    n = n + 1.
end.

During the re-writing the original parameters' list containing '(CAT* | ANT*) & (SMALL | DEER)' was modified to ['DEER', 'SMALL', 'ANT%', 'CAT%'].
However the query is executed multiple times with different modifications and at the final steps I see '(CAT* | ANT*) & (SMALL | DEER)' in the parameters' list again.
Here is the excerpt from the PostreSQL log:
2020-11-26 15:58:18.488 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.757 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $5
2020-11-26 15:58:18.489 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.999 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $5
2020-11-26 15:58:18.489 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = '1'
2020-11-26 15:58:18.489 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.057 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $5
2020-11-26 15:58:18.489 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = '1'
2020-11-26 15:58:18.492 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.047 ms  parse <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 15:58:18.492 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.045 ms  bind <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 15:58:18.492 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = '10003'
2020-11-26 15:58:18.492 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.009 ms  execute <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 15:58:18.492 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = '10003'
2020-11-26 15:58:18.546 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.166 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $5
2020-11-26 15:58:18.547 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.745 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $5
2020-11-26 15:58:18.547 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = '25'
2020-11-26 15:58:18.547 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.046 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $5
2020-11-26 15:58:18.547 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = '25'
2020-11-26 15:58:18.557 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.279 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as id0_, ttwi__impl0_.if1 as if1_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) = upper($5) and ttwi__impl0_.recid > $6
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $7
2020-11-26 15:58:18.558 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.821 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as id0_, ttwi__impl0_.if1 as if1_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) = upper($5) and ttwi__impl0_.recid > $6
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $7
2020-11-26 15:58:18.558 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '10003', $7 = '1'
2020-11-26 15:58:18.558 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.020 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as id0_, ttwi__impl0_.if1 as if1_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) = upper($5) and ttwi__impl0_.recid > $6
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $7
2020-11-26 15:58:18.558 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '10003', $7 = '1'
2020-11-26 15:58:18.560 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.167 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as id0_, ttwi__impl0_.if1 as if1_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.560 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.469 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as id0_, ttwi__impl0_.if1 as if1_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.560 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '1'
2020-11-26 15:58:18.560 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.040 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as id0_, ttwi__impl0_.if1 as if1_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.560 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '1'
2020-11-26 15:58:18.567 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.189 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.568 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.825 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.568 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '1'
2020-11-26 15:58:18.568 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.052 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.568 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '1'
2020-11-26 15:58:18.574 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.187 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.575 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.785 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.575 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = 'DEER WITH ANTLER', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '25'
2020-11-26 15:58:18.575 MSK [32764] fwd_user@fwd1 LOG:  duration: 0.030 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) and upper(rtrim(ttwi__impl0_.if1)) > upper($5)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $6
2020-11-26 15:58:18.575 MSK [32764] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = 'DEER WITH ANTLER', $3 = 'ANT%', $4 = 'CAT%', $5 = 'DEER WITH ANTLER', $6 = '25'

Actually it seems that something is going wrong and not only regarding parameters.
I'm investigating it now and any suggestions are highly appreciated.

#100 Updated by Igor Skornyakov over 3 years ago

In addition to the previous note. Without CONTAINS re-writing the PostgreSQL log looks like the following:

    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        contains(upper(rtrim(ttwi__impl0_.if1)), $1)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $2
2020-11-26 16:41:01.962 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.148 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        contains(upper(rtrim(ttwi__impl0_.if1)), $1)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $2
2020-11-26 16:41:01.962 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = '1'
2020-11-26 16:41:01.998 MSK [8133] fwd_user@fwd1 LOG:  duration: 36.574 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        contains(upper(rtrim(ttwi__impl0_.if1)), $1)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $2
2020-11-26 16:41:01.998 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = '1'
2020-11-26 16:41:02.002 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.088 ms  parse <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 16:41:02.002 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.075 ms  bind <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 16:41:02.002 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '10003'
2020-11-26 16:41:02.002 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.013 ms  execute <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 16:41:02.002 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '10003'
2020-11-26 16:41:02.047 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.100 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        contains(upper(rtrim(ttwi__impl0_.if1)), $1)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $2
2020-11-26 16:41:02.047 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.092 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        contains(upper(rtrim(ttwi__impl0_.if1)), $1)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $2
2020-11-26 16:41:02.047 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = '25'
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.601 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        contains(upper(rtrim(ttwi__impl0_.if1)), $1)
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $2
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = '25'
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.042 ms  parse <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.068 ms  bind <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '10001'
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 LOG:  duration: 0.012 ms  execute <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 16:41:02.048 MSK [8133] fwd_user@fwd1 DETAIL:  parameters: $1 = '10001'

This is less mysterious but doesn't look efficient anyway.

#101 Updated by Igor Skornyakov over 3 years ago

One more example.
Consider the 4GL program:

def var expr as char init "(cat* | ant*) & (small | deer)".
def var n as int init 0.
for each ttwi where (if1 contains expr) or  (if1 contains 'big & elephant | small & rabbit'):
    n = n + 1.
    message n if1.
end.

The PosgreSQL logs for running it with FWD is:
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) or (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $5 or w1.word = $6))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = $7 or w2.word = $8))
            join ttwi_if1 as w3 on (w3.parent = t.recid and (w3.word = $9 or w3.word = $10))
            join ttwi_if1 as w4 on (w4.parent = t.recid and (w4.word = $11 or w4.word = $12))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $13
2020-11-26 18:02:28.556 MSK [21696] fwd_user@fwd1 LOG:  duration: 1.304 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) or (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $5 or w1.word = $6))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = $7 or w2.word = $8))
            join ttwi_if1 as w3 on (w3.parent = t.recid and (w3.word = $9 or w3.word = $10))
            join ttwi_if1 as w4 on (w4.parent = t.recid and (w4.word = $11 or w4.word = $12))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $13
2020-11-26 18:02:28.556 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'RABBIT', $6 = 'ELEPHANT', $7 = 'RABBIT', $8 = 'BIG', $9 = 'SMALL', $10 = 'ELEPHANT', $11 = 'SMALL', $12 = 'BIG', $13 = '1'
2020-11-26 18:02:28.556 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.060 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) or (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $5 or w1.word = $6))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = $7 or w2.word = $8))
            join ttwi_if1 as w3 on (w3.parent = t.recid and (w3.word = $9 or w3.word = $10))
            join ttwi_if1 as w4 on (w4.parent = t.recid and (w4.word = $11 or w4.word = $12))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $13
2020-11-26 18:02:28.556 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = 'DEER', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'RABBIT', $6 = 'ELEPHANT', $7 = 'RABBIT', $8 = 'BIG', $9 = 'SMALL', $10 = 'ELEPHANT', $11 = 'SMALL', $12 = 'BIG', $13 = '1'
2020-11-26 18:02:28.560 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.054 ms  parse <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 18:02:28.560 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.055 ms  bind <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 18:02:28.560 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = '10002'
2020-11-26 18:02:28.560 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.011 ms  execute <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 18:02:28.560 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = '10002'
2020-11-26 18:02:28.610 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.300 ms  parse <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) or (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $5 or w1.word = $6))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = $7 or w2.word = $8))
            join ttwi_if1 as w3 on (w3.parent = t.recid and (w3.word = $9 or w3.word = $10))
            join ttwi_if1 as w4 on (w4.parent = t.recid and (w4.word = $11 or w4.word = $12))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $13
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 LOG:  duration: 2.556 ms  bind <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) or (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $5 or w1.word = $6))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = $7 or w2.word = $8))
            join ttwi_if1 as w3 on (w3.parent = t.recid and (w3.word = $9 or w3.word = $10))
            join ttwi_if1 as w4 on (w4.parent = t.recid and (w4.word = $11 or w4.word = $12))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $13
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'RABBIT', $6 = 'ELEPHANT', $7 = 'RABBIT', $8 = 'BIG', $9 = 'SMALL', $10 = 'ELEPHANT', $11 = 'SMALL', $12 = 'BIG', $13 = '25'
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.056 ms  execute <unnamed>: 
    select 
        ttwi__impl0_.recid as col_0_0_ 
    from
        ttwi ttwi__impl0_ 
    where
        (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $1 or w1.word = $2))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word like $3 or w2.word like $4))
    )) or (ttwi__impl0_.recid in (
        select recid from ttwi t
            join ttwi_if1 as w1 on (w1.parent = t.recid and (w1.word = $5 or w1.word = $6))
            join ttwi_if1 as w2 on (w2.parent = t.recid and (w2.word = $7 or w2.word = $8))
            join ttwi_if1 as w3 on (w3.parent = t.recid and (w3.word = $9 or w3.word = $10))
            join ttwi_if1 as w4 on (w4.parent = t.recid and (w4.word = $11 or w4.word = $12))
    ))
    order by
        upper(rtrim(ttwi__impl0_.if1)) asc, ttwi__impl0_.recid asc
     limit $13
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)', $2 = 'SMALL', $3 = 'ANT%', $4 = 'CAT%', $5 = 'RABBIT', $6 = 'ELEPHANT', $7 = 'RABBIT', $8 = 'BIG', $9 = 'SMALL', $10 = 'ELEPHANT', $11 = 'SMALL', $12 = 'BIG', $13 = '25'
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.056 ms  parse <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.063 ms  bind <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = '10001'
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 LOG:  duration: 0.010 ms  execute <unnamed>: select if1 from ttwi where recid=$1
2020-11-26 18:02:28.613 MSK [21696] fwd_user@fwd1 DETAIL:  parameters: $1 = '10001'

Here we get only 2 records retrieved (instead of 3).
All this looks really confusing for me.

#102 Updated by Ovidiu Maxiniuc over 3 years ago

At first look the queries look OK to me, nice work!

The queries will try to fetch rows incrementally, in increasing bucket sizes. If I am not mistaken, they are something like 1, 25, 577, 13k+ rows. This is why your query is executed multiple time and subsequent statements will contain additional clauses for continuing from where it remained: and upper(rtrim(ttwi__impl0_.if1)) > upper($5).

It seems that this adds some issues which you need to isolate and identify. One thing I did not understand in the logs from note 99 is that somehow in some cases the original pattern string leaked to SQL: parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)' .... I do not understand why this happened. Maybe a flaw in parameter management?

#103 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

At first look the queries look OK to me, nice work!

The queries will try to fetch rows incrementally, in increasing bucket sizes. If I am not mistaken, they are something like 1, 25, 577, 13k+ rows. This is why your query is executed multiple time and subsequent statements will contain additional clauses for continuing from where it remained: and upper(rtrim(ttwi__impl0_.if1)) > upper($5).

I understand the idea. What I do not understand is why the behavior is so different for different queries. I also think that the logic is not completely reliable for at least two reasons:
  1. Imagine that the full result set contains records with identical values of if1. In this case the and upper(rtrim(ttwi__impl0_.if1)) > upper($5). will most likely return incorrect subset.
  2. What if the table was updated between subsequent calls?

It seems that this adds some issues which you need to isolate and identify. One thing I did not understand in the logs from note 99 is that somehow in some cases the original pattern string leaked to SQL: parameters: $1 = '(CAT* | ANT*) & (SMALL | DEER)' .... I do not understand why this happened. Maybe a flaw in parameter management?

Yes, this what I'm trying to figure.

#104 Updated by Igor Skornyakov over 3 years ago

Well, the re-written query parameters are overridden in the Persistence.getQuery(). The query is retrieved from the cache and contains re-written SQLQuery and parameters. The query is not converted again, but the parameters are partially overridden (although paramCount is not changed).
The simples solution I can see is not to cache re-written queries (e.g. by adding a corresponding flag), but I'm not sure if it is OK.
Any suggestions?
Thank you.

#105 Updated by Igor Skornyakov over 3 years ago

I've implemented the fix that doesn't allow re-use of cached but re-written queries. This fixes the issue with the query parameters I've reported before )#1587-99)

Committed to 1587b revision 11836.

Please review.
If it is OK, I will start working on the generation of the database objects and import for tables with word indexes.
Thank you.

#106 Updated by Igor Skornyakov over 3 years ago

Branch 1587b was rebased to 3821c revision 11839.
Committed revision 11844.

#107 Updated by Ovidiu Maxiniuc over 3 years ago

Review of 1587b/r11844.

General note: I like the amount of effort put in obtaining the conjunctive form of the final SQL statement. I only have some minor performance and code format issues:
  • Persistence.java: just as a remark: it would have been better if the query was not added at all in the cache if it features a contains operator. Unfortunately, at that moment only the FQL was not yet converted to SQL, so we do not have this piece of information. However, the current approach will drop the fully processed query. I wonder if it worth duplicating the query and work on a disposable instance when the contains is detected. I am a bit sceptic about the performance gain as the SQLQuery is quite shallow, its heavy duty is delegated to FQL to SQL converted and actual database calls;
  • DataTypeHelper.java: please move the newly added ARRAY type to "Standard Java wrapper types" block;
  • FqlToSqlConverter:
    • methods that return String s to be appended (containsQuery() and onClause) can receive the sb as parameter and append its output directly instead of creating locally a new StringBuilder. Never use String.format in combination with StringBuilder. It is 100x slower than appending to existing builder. Replace string concatenation with append, even if the code is not as readable.
    • I do not like the aux.tblAlias and aux.fldName detection at line 1816 especially the sb.delete(). Not very solid and difficult to maintain. I think we should add some parameters to generateProperty() if needed and process it there;
  • Query.java: Please reduce the visibility of paramCount. It should not be public;
  • generally: some the parameter lists and javadocs do not comply with the GCD code standards.

#108 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Review of 1587b/r11844.

General note: I like the amount of effort put in obtaining the conjunctive form of the final SQL statement. I only have some minor performance and code format issues:
  • Persistence.java: just as a remark: it would have been better if the query was not added at all in the cache if it features a contains operator. Unfortunately, at that moment only the FQL was not yet converted to SQL, so we do not have this piece of information. However, the current approach will drop the fully processed query. I wonder if it worth duplicating the query and work on a disposable instance when the contains is detected. I am a bit sceptic about the performance gain as the SQLQuery is quite shallow, its heavy duty is delegated to FQL to SQL converted and actual database calls;
  • DataTypeHelper.java: please move the newly added ARRAY type to "Standard Java wrapper types" block;
  • FqlToSqlConverter:
    • methods that return String s to be appended (containsQuery() and onClause) can receive the sb as parameter and append its output directly instead of creating locally a new StringBuilder. Never use String.format in combination with StringBuilder. It is 100x slower than appending to existing builder. Replace string concatenation with append, even if the code is not as readable.
    • I do not like the aux.tblAlias and aux.fldName detection at line 1816 especially the sb.delete(). Not very solid and difficult to maintain. I think we should add some parameters to generateProperty() if needed and process it there;
  • Query.java: Please reduce the visibility of paramCount. It should not be public;
  • generally: some the parameter lists and javadocs do not comply with the GCD code standards.

Thank you. Re-working...

#109 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Review of 1587b/r11844.

General note: I like the amount of effort put in obtaining the conjunctive form of the final SQL statement. I only have some minor performance and code format issues:
  • Persistence.java: just as a remark: it would have been better if the query was not added at all in the cache if it features a contains operator. Unfortunately, at that moment only the FQL was not yet converted to SQL, so we do not have this piece of information. However, the current approach will drop the fully processed query. I wonder if it worth duplicating the query and work on a disposable instance when the contains is detected. I am a bit sceptic about the performance gain as the SQLQuery is quite shallow, its heavy duty is delegated to FQL to SQL converted and actual database calls;

I see another problem with this. Because of the synchronized block in the Persistence.Context.getQuery I understand that the staticQueryCache can be accessed from different threads. If two threads will start to concurrently convert the same query with CONTAINS the result will be disastrous. On the other side, I do not see much reason to put a query to the cache so early. May be it makes sense to create a Query instance with a Consumer which will put it to cache and it will be called by the Query after conversion after conversion if it is possible? We may have other cases when conversion result depends on the query parameters and/or modifies the paramaters' list.
What do you think?

  • DataTypeHelper.java: please move the newly added ARRAY type to "Standard Java wrapper types" block;

Fixed.

  • FqlToSqlConverter:
    • methods that return String s to be appended (containsQuery() and onClause) can receive the sb as parameter and append its output directly instead of creating locally a new StringBuilder. Never use String.format in combination with StringBuilder. It is 100x slower than appending to existing builder. Replace string concatenation with append, even if the code is not as readable.

Fixed.

  • I do not like the aux.tblAlias and aux.fldName detection at line 1816 especially the sb.delete(). Not very solid and difficult to maintain. I think we should add some parameters to generateProperty() if needed and process it there;
  • Query.java: Please reduce the visibility of paramCount. It should not be public;

Fixed.

  • generally: some the parameter lists and javadocs do not comply with the GCD code standards.

I've reviewed my code and made some fixes. Unfortunately, some formatting issues may remain unnoticed - my eyes simple do not see such things, sorry.

Committed to 1587b revision 11845.

#110 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

I see another problem with this. Because of the synchronized block in the Persistence.Context.getQuery I understand that the staticQueryCache can be accessed from different threads. If two threads will start to concurrently convert the same query with CONTAINS the result will be disastrous. On the other side, I do not see much reason to put a query to the cache so early. May be it makes sense to create a Query instance with a Consumer which will put it to cache and it will be called by the Query after conversion after conversion if it is possible? We may have other cases when conversion result depends on the query parameters and/or modifies the paramaters' list.

I have implemented this trick. See 1587b rev.11846.
BTW: we can make the caching strategy more elaborate now. For example cache only those queries which conversion took time more than a threshold. The threshold value can be dynamically adjusted based on the statistics of the already converted queries.

#111 Updated by Igor Skornyakov over 3 years ago

As we discussed before (#1587-65) we may have to use hints for the names of the database objects associated with word indexes (word table name, trigger function, and two triggers). I suggest using just a simple flat file with the following information:

<table name>.<field name>=<word table name>,<trigger function name>,<AFTER INSERT trigger name>,<AFTER UPDATE trigger name>

If there is no record for a table/field or a particular entry then the default name (based on concatenation like in #1587-71) will be used.
This file will be used in the database object generation data import and runtime (e.g. will be loaded and published by MetadataManager).
Is it OK?
Thank you.

#112 Updated by Eric Faulhaber over 3 years ago

Are there real cases of the converted names exceeding the PostgreSQL-imposed limits, or do limits also exist in the 4GL which make such a case impossible? In other words, is this only a theoretical problem, or a practical one? If it is a practical problem...

Is there a compelling reason to not use the existing schema hint infrastructure? We would have to define new XML syntax to cover this case, but I don't want to maintain yet another special-purpose file, if we don't have to.

The hints could be read from <schema-name>.schema.hints during conversion. The non-default, hinted names would be stored as annotations in the DMO interface, like we do for all other non-schema metadata, and added to DMOMeta. If the default convention is used, I don't think it is necessary to store an annotation, since the names can be derived during DMOMeta creation at runtime.

Schema conversion should warn if a converted identifier name (using the default naming convention) exceeds the known limit for a particular database dialect, so we know which identifiers require hints. In fact, we probably should be doing this for converted identifiers in general (not just in the context of word indices). I was not previously aware of this limit for PostgreSQL.

#113 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Are there real cases of the converted names exceeding the PostgreSQL-imposed limits, or do limits also exist in the 4GL which make such a case impossible? In other words, is this only a theoretical problem, or a practical one? If it is a practical problem...

Is there a compelling reason to not use the existing schema hint infrastructure? We would have to define new XML syntax to cover this case, but I don't want to maintain yet another special-purpose file, if we don't have to.

The hints could be read from <schema-name>.schema.hints during conversion. The non-default, hinted names would be stored as annotations in the DMO interface, like we do for all other non-schema metadata, and added to DMOMeta. If the default convention is used, I don't think it is necessary to store an annotation, since the names can be derived during DMOMeta creation at runtime.

Schema conversion should warn if a converted identifier name (using the default naming convention) exceeds the known limit for a particular database dialect, so we know which identifiers require hints. In fact, we probably should be doing this for converted identifiers in general (not just in the context of word indices). I was not previously aware of this limit for PostgreSQL.

I'm not sure if there are any limitations for the table/field names in 4GL but the limitations of PostgreSQL look to be restrictive enough to consider this situation. I was not aware of our standard approach with schema hints, sorry for my ignorance. Of course, it is better not to "multiply entities without necessity", and I will use this approach.
Thank you.

#114 Updated by Greg Shah over 3 years ago

Here is another point to consider.

We must avoid adding extra files to the deployment requirements for the FWD server. We have flexibility for anything that is only conversion time BUT we also know that WHERE clause conversion is also done at runtime for dynamic queries. So if this would potentially be a runtime requirement, then we would not emit a flat file for this usage.

Options that are acceptable include:

  • emitting configuration in code (e.g. annotations in a DMO) preferred
  • add configuration to the directory
  • adding data to the converted application jar that is loaded as a resource (this should be avoided but is possible; XML should be considered to make this well-structured)

#115 Updated by Igor Skornyakov over 3 years ago

Finished generation of the database objects for the word indexes' support for the PostgreSQL dialect.
Committed to 1587b rev. 11847.
Please see a sample database and generated objects attached.
Please note that an additional SQL file (schema_trigger*.sql) is generated. To create the corresponding object please add the following at the end if the create.db.pg task:

      <exec executable="psql">
         <env key="PGPASSWORD" value="${sql.user.pass}" />
         <arg value="-U" />
         <arg value="${sql.user}" />
         <arg value="-h" />
         <arg value="${db.host}" />
         <arg value="-p" />
         <arg value="${db.port}" />
         <arg value="-f" />
         <arg value="ddl/schema_trigger_${db.name}_postgresql.sql" />
         <arg value="${db.name}" />
      </exec>

Please note that the generated objects' renaming via hints is not yet implemented. Working on it now. We really need it as OpendEdge table/field names can contain up to 32 characters.

#116 Updated by Greg Shah over 3 years ago

Please note that the generated objects' renaming via hints is not yet implemented. Working on it now. We really need it as OpendEdge table/field names can contain up to 32 characters.

To be clear: you are going to generate a "standard" XML schema hints file during conversion?

#117 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Please note that the generated objects' renaming via hints is not yet implemented. Working on it now. We really need it as OpendEdge table/field names can contain up to 32 characters.

To be clear: you are going to generate a "standard" XML schema hints file during conversion?

I'm going to use the approach suggested by Eric in #1587-112. I understand this is our standard way to deal with this kind of problem. I do not know the full details so far, analysing.

#118 Updated by Greg Shah over 3 years ago

My point is that we should not require manual intervention in order to convert something that just happens to exceed some arbitrary name length in a database. We should automatically calculate a safe name for these cases. If the customer wishes to override that with a better name using schema hints, that is fine. But it makes no sense to force the creation of a manual hint for each such case. Doing that would add more points of failure, extra work, a more complex process, more documentation and more cost for no benefit.

Surely we can come up with an algorithm to create a valid name. And we can add code to ensure it is unique.

#119 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

My point is that we should not require manual intervention in order to convert something that just happens to exceed some arbitrary name length in a database. We should automatically calculate a safe name for these cases. If the customer wishes to override that with a better name using schema hints, that is fine. But it makes no sense to force the creation of a manual hint for each such case. Doing that would add more points of failure, extra work, a more complex process, more documentation and more cost for no benefit.

Surely we can come up with an algorithm to create a valid name. And we can add code to ensure it is unique.

OK, I will think about this. Thank you.

#120 Updated by Eric Faulhaber over 3 years ago

Igor, I'm beginning to look over 1587b/11847.

Other than working out the name length issue discussed in the previous posting, are there any functional features left to implement? One question in this regard: I don't see any explicit update to the data import process to populate the word table(s) with initial values, based on imported, word-indexed fields. Is the idea to create the triggers during the initial schema creation, such that these are just populated automatically by the primary record inserts during import?

Do you have a sense of the performance of this implementation, compared to the initial UDF implementation? Things to consider:

  • the cost of the re-indexing work done by the trigger at insert/update (and delete);
  • small vs. large text data;
  • heavy read vs. heavy write.

#121 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Other than working out the name length issue discussed in the previous posting, are there any functional features left to implement? One question in this regard: I don't see any explicit update to the data import process to populate the word table(s) with initial values, based on imported, word-indexed fields. Is the idea to create the triggers during the initial schema creation, such that these are just populated automatically by the primary record inserts during import?

At this moment I've made no changes for import and the word tables are populated by triggers on the import of the master table. I've tested it and it works. However, I think that may be it makes sense to create triggers after import and populate word tables 'manually'

Do you have a sense of the performance of this implementation, compared to the initial UDF implementation? Things to consider:

  • the cost of the re-indexing work done by the trigger at insert/update (and delete);
  • small vs. large text data;
  • heavy read vs. heavy write.

I've not performed any performance testing so far (I've just finished the implementations). Based on my experience with relational databases I'm sure that at least for the simple queries the new approach should be much faster than one based on the full scan with UDF. For corner cases, it is difficult to predict as it depends on the PostgreSQL query optimizer logic.
Maybe it makes sense to use re-writing with WITH clause as it may help the optimizer to make a better decision.
Regarding the cost of re-indexing on insert/update/delete, I can only say that it will be some additional overhead but I do not think that it will be much more dramatic than one 4GL has for supporting word indexes (regardless of how they implement it). Of course, performance testing is required, but I'm not sure what will be better - to use real data from a big customer table or to generate test data.

#122 Updated by Igor Skornyakov over 3 years ago

BTW: If I understand correctly, we create primary keys and foreign key indexes (for extent tables) before the import. How it affects the speed of import?
Thank you.

#123 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

BTW: If I understand correctly, we create primary keys and foreign key indexes (for extent tables) before the import. How it affects the speed of import?

Yes, and all unique indices as well. Non-unique indices are created after the import of a table is complete. This very likely slows the import, though I've never created a baseline without doing this, so I have no baseline for comparison. We rely on the unique indices to identify duplicate records which are supposed to be unique. I don't know how, but sometimes these appear in the *.d files. If we had to detect these in the import code instead of letting the database find the errors, it certainly would be even slower. If we created the unique indices later and there were duplicate records, I'm not sure how we would recover without manual intervention.

As to the creation of the primary keys and the foreign keys from secondary (extent) tables to primary tables, we probably could defer this until after the records are imported, if the performance difference is worth it. We have complete control over the generation of the primary key values, so we know they should be safe.

My question about performance was primarily concerned with runtime performance, rather than import performance. Nevertheless, I appreciate that you are considering all aspects of performance. The reason we create the primary keys and the foreign keys before import is primarily historical: we originally relied on Hibernate to generate the DDL, and it would include the statements to create these keys in the same SQL script as the statements to create the tables. When we removed Hibernate, we continued to follow this practice by default, as it was known to work well. Ovidiu, correct me if I'm wrong in my assumptions, if you considered and rejected these different approaches when rewriting the DDL generation code.

In any event, now we have full control over the DDL generation and import, so we can do what makes the most sense.

#124 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Yes, and all unique indices as well. Non-unique indices are created after the import of a table is complete. This very likely slows the import, though I've never created a baseline without doing this, so I have no baseline for comparison. We rely on the unique indices to identify duplicate records which are supposed to be unique. I don't know how, but sometimes these appear in the *.d files. If we had to detect these in the import code instead of letting the database find the errors, it certainly would be even slower. If we created the unique indices later and there were duplicate records, I'm not sure how we would recover without manual intervention.

As to the creation of the primary keys and the foreign keys from secondary (extent) tables to primary tables, we probably could defer this until after the records are imported, if the performance difference is worth it. We have complete control over the generation of the primary key values, so we know they should be safe.

My question about performance was primarily concerned with runtime performance, rather than import performance. Nevertheless, I appreciate that you are considering all aspects of performance. The reason we create the primary keys and the foreign keys before import is primarily historical: we originally relied on Hibernate to generate the DDL, and it would include the statements to create these keys in the same SQL script as the statements to create the tables. When we removed Hibernate, we continued to follow this practice by default, as it was known to work well. Ovidiu, correct me if I'm wrong in my assumptions, if you considered and rejected these different approaches when rewriting the DDL generation code.

In any event, now we have full control over the DDL generation and import, so we can do what makes the most sense.

I see. Thank you for the clarification.

#125 Updated by Ovidiu Maxiniuc over 3 years ago

Eric Faulhaber wrote:

The reason we create the primary keys and the foreign keys before import is primarily historical: we originally relied on Hibernate to generate the DDL, and it would include the statements to create these keys in the same SQL script as the statements to create the tables. When we removed Hibernate, we continued to follow this practice by default, as it was known to work well. Ovidiu, correct me if I'm wrong in my assumptions, if you considered and rejected these different approaches when rewriting the DDL generation code.

True. Since this worked well with Hibernate I kept it with new proprietary solution. Also, if we import the full data set beforehand and then attempt to create impose the constraints data corruption may occur. Adding the constraints at the end might be impossible if data is corrupted. The problem is you cannot decide at this moment how to fix it. If constraints are already set, the import will report (possibly stop) when such errors occur.

In any event, now we have full control over the DDL generation and import, so we can do what makes the most sense.

Indeed. Although the general structure will probably stay the same, we can do slight changes or even enforce more constraints just to be sure the (new) data is sound/protected.

#126 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

True. Since this worked well with Hibernate I kept it with new proprietary solution. Also, if we import the full data set beforehand and then attempt to create impose the constraints data corruption may occur. Adding the constraints at the end might be impossible if data is corrupted. The problem is you cannot decide at this moment how to fix it. If constraints are already set, the import will report (possibly stop) when such errors occur.

Maybe it makes sense to add an import option to control when to create these indexes? I understand that we import the same data multiple times in most cases and if the first import was successful, this my speed-up the next.

#127 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

Maybe it makes sense to add an import option to control when to create these indexes?

Why? All decisions at import are taken by FWD. What can change the workflow of the process?

I understand that we import the same data multiple times in most cases and if the first import was successful, this my speed-up the next.

We do this only when something changed. For example, if we decided that another SQL type is better match for a 4GL counterpart. But this usually means a new database schema and causes versions incompatibilities, so nothing to speed-up.

Indeed, the import process can be resumed, instead but only at table level. The import will check whether a table is empty in database and attempt to populate it. If there is at least a record in the table (when a failed import occurred), the table is skipped.

#128 Updated by Eric Faulhaber over 3 years ago

Overall, 1587b seems to be nice addition to the project! This is not a full code review, but I do have a few concerns/questions upon my first reading of 1587b/11847.

  • I am a bit confused by the words UDF. It is added as a PL/Java function in Functions and p2jpl.ddr, but it also seems to be added as a PL/PGSQL function in DDLGeneratorWorker. Doesn't this cause a conflict?
  • Was the contains UDF left in place as a backstop for non-PostgreSQL dialects? Ultimately, we have to make this work properly on all supported databases. Hopefully, the SQL features in use are very similar or the same across dialects.
  • DDLGeneratorWorker, FqlToSqlConverter, etc. (most runtime classes) should be dialect-neutral. Any dialect-specific work should be performed in one of the concrete subclasses of Dialect (as opposed to testing dialect instanceof P2JPostgreSQLDialect and then performing dialect-specific work inline in the "neutral" class). There should be no awareness of specific dialect implementations in this neutral layer.
  • Hard-coded use of the recid string for primary key name. We should use the Session.PK variable (at runtime) or schema.schemaConfig.primaryKeyName (at conversion, assuming the SchemaWorker is loaded with the namespace schema by the TRPL program).

Ovidiu, please look at dmo_common.rules. Based on Igor's change, it looks like we were omitting the index annotation from the DMO interfaces until now, if the index was a word index. How, then, are we getting all these "Word indices not supported" warnings from RecordMeta during server startup of customer projects? I am not remembering something correctly...maybe you can refresh my memory.

#129 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

  • I am a bit confused by the words UDF. It is added as a PL/Java function in Functions and p2jpl.ddr, but it also seems to be added as a PL/PGSQL function in DDLGeneratorWorker. Doesn't this cause a conflict?

No, it doesn't as the PL/PGSQL function has a different signature from the Java UDFs.

  • Was the contains UDF left in place as a backstop for non-PostgreSQL dialects? Ultimately, we have to make this work properly on all supported databases. Hopefully, the SQL features in use are very similar or the same across dialects.

I understand this, but I've started from PostgreSQL as at this moment I have a better working knowledge of it.

  • DDLGeneratorWorker, FqlToSqlConverter, etc. (most runtime classes) should be dialect-neutral. Any dialect-specific work should be performed in one of the concrete subclasses of Dialect (as opposed to testing dialect instanceof P2JPostgreSQLDialect and then performing dialect-specific work inline in the "neutral" class). There should be no awareness of specific dialect implementations in this neutral layer.

I see. I will refactor my code accordingly.

  • Hard-coded use of the recid string for primary key name. We should use the Session.PK variable (at runtime) or schema.schemaConfig.primaryKeyName (at conversion, assuming the SchemaWorker is loaded with the namespace schema by the TRPL program).

Yes, this is my fault. Will be re-worked. Thank you.

#130 Updated by Igor Skornyakov over 3 years ago

Sorry, I forgot to add. At this moment for the dialects other than PostgreSQL the CONTIANS UDF will be still used.

#131 Updated by Igor Skornyakov over 3 years ago

I've re-worked the code to address Eric's comments (#1587-128).
There is a problem with renaming word tables - we need the new names at runtime for SQL re-writing. Looking for a solution.

#132 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

I've re-worked the code to address Eric's comments (#1587-128).

Committed to 1587b revision 11848

#133 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

I've re-worked the code to address Eric's comments (#1587-128).
There is a problem with renaming word tables - we need the new names at runtime for SQL re-writing. Looking for a solution.

For things like this, we usually compute the value at schema conversion time and write it as an annotation to the DMO interface, then load it into a DmoMeta instance at server startup. Every RecordBuffer has access to its associated DmoMeta instance, so if you have access to a RecordBuffer, you automatically have fast, direct access to its DmoMeta, without a map lookup.

#134 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

For things like this, we usually compute the value at schema conversion time and write it as an annotation to the DMO interface, then load it into a DmoMeta instance at server startup. Every RecordBuffer has access to its associated DmoMeta instance, so if you have access to a RecordBuffer, you automatically have fast, direct access to its DmoMeta, without a map lookup.

Thank you. The problem is that at this moment I do not understand how to add annotation from the DDLGeneratorWorker. It seems that I nave to calculate the name before it is called.

#135 Updated by Ovidiu Maxiniuc over 3 years ago

Eric Faulhaber wrote:

Ovidiu, please look at dmo_common.rules. Based on Igor's change, it looks like we were omitting the index annotation from the DMO interfaces until now, if the index was a word index. How, then, are we getting all these "Word indices not supported" warnings from RecordMeta during server startup of customer projects? I am not remembering something correctly...maybe you can refresh my memory.

The word indexes were generated in DMO interfaces, hence the "Word indices not supported" message, but the index definition was not generated as a DDL definition.
As result, the runtime and DMO itself was aware of these indexes, but practically, they were not usable, because of a very different semantic. This is why the was not created at all. Our queries wouldn't benefit from it and we skipped their DDLs to avoid the additional work on database side when records were inserted/updated.

#136 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

The problem is that at this moment I do not understand how to add annotation from the DDLGeneratorWorker. It seems that I nave to calculate the name before it is called.

Correct. DDLGeneratorWorker would not be the producer of these names. Rather, it would be a consumer of this information, which could be produced earlier, during conversion. Likewise, any runtime facility (e.g., the query rewriter) would be a consumer of this information. To keep it simple, we could use a least-common-denominator approach for the names. That is, if a PostgreSQL (or other dialect) size limit requires a custom name, we could assume all dialects require it, so we wouldn't have to differentiate by dialect. This will keep both the name creation logic and the DMO interface annotations cleaner.

See the createCompileUnit TRPL function in rules/schema/dmo_common.rules for examples of the annotations created for a DMO interface. If needed, your name creation logic would be invoked from here and the result stored as a Java annotation in the DMO interface.

#137 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

See the createCompileUnit TRPL function in rules/schema/dmo_common.rules for examples of the annotations created for a DMO interface. If needed, your name creation logic would be invoked from here and the result stored as a Java annotation in the DMO interface.

Thank you! I'm looking into rules/schema/dmo_common.rules now. but it seems that to a wrong place (.

#138 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

See the createCompileUnit TRPL function in rules/schema/dmo_common.rules for examples of the annotations created for a DMO interface. If needed, your name creation logic would be invoked from here and the result stored as a Java annotation in the DMO interface.

Thank you! I'm looking into rules/schema/dmo_common.rules now. but it seems that to a wrong place (.

OK, I think this is the place the Java annotations are actually written into the Java AST which becomes the DMO interface file. The name would actually be calculated earlier and added to the P2O AST. A good place to add TRPL logic to invoke your name computation would be within this TRPL code (around line 729) in p2o_post.xml:

            <!-- if word, mark the peer as such -->
            <rule>this.isAnnotation("word")
               <action>peer.putAnnotation("word", true)</action>
            </rule>

You could integrate your name computation Java code into DataModelWorker, so it can be invoked from TRPL. Then invoke it from p2o_post.xml and use peer.putAnnotation to store an AST annotation with the computed name. Later, retrieve this annotation and write it out as a Java annotation from createCompileUnit in dmo_common.rules.

The thing to understand is that in p2o_post.xml, we are working with two ASTs: a P2O AST, which is derived from the 4GL schema, and a Java AST (the "peer" AST) which we are building up to be the DMO interface Java AST. You want to store an AST annotation with the computed name in the peer AST, so that it is available a few steps later in the pipeline, when we are writing out the Java annotations in dmo_common.rules.

Ovidiu: please reality check my statements and file references above, in case my recollection is flawed. I have not looked at this code in a while, and I know you made some significant changes recently when we dropped the generation of the implementation DMOs to go with the interface DMOs only. I just want to be sure I am not misleading Igor with my suggestions, or pointing him at recently deceased code.

#139 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

OK, I think this is the place the Java annotations are actually written into the Java AST which becomes the DMO interface file. The name would actually be calculated earlier and added to the P2O AST. A good place to add TRPL logic to invoke your name computation would be within this TRPL code (around line 729) in p2o_post.xml:

[...]

You could integrate your name computation Java code into DataModelWorker, so it can be invoked from TRPL. Then invoke it from p2o_post.xml and use peer.putAnnotation to store an AST annotation with the computed name. Later, retrieve this annotation and write it out as a Java annotation from createCompileUnit in dmo_common.rules.

The thing to understand is that in p2o_post.xml, we are working with two ASTs: a P2O AST, which is derived from the 4GL schema, and a Java AST (the "peer" AST) which we are building up to be the DMO interface Java AST. You want to store an AST annotation with the computed name in the peer AST, so that it is available a few steps later in the pipeline, when we are writing out the Java annotations in dmo_common.rules.

Got it. Thank you!

#140 Updated by Ovidiu Maxiniuc over 3 years ago

Eric Faulhaber wrote:

Ovidiu: please reality check my statements and file references above, in case my recollection is flawed. I have not looked at this code in a while, and I know you made some significant changes recently when we dropped the generation of the implementation DMOs to go with the interface DMOs only. I just want to be sure I am not misleading Igor with my suggestions, or pointing him at recently deceased code.

Yes, the information is mainly correct. I would look back one step, looking for the name.convert() calls in p2o.xml. I assume the problem is the SQL table name so the second parameter is TYPE_TABLE. That call is further delegated to NameConverter.convert(). The problem is, at that level, you only have access to the table name itself, not the full sub-tree to inspect. Maybe, in NameConverterWorker you can access the resolver.getSourceAst() to obtain the current AST being processed.

Bottom line, I guess the best place to intercept is in NameConverterWorker$Library.convert(String, int, String, Map). After obtaining the result from the converter, test the type and whether its AST is involved in word indexes. If so, try to adjust the result obtained. Alternatively, you can do this for word index fields, too, if this helps avoiding SAL (LE: SQL) naming limits.

#141 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Yes, the information is mainly correct. I would look back one step, looking for the name.convert() calls in p2o.xml. I assume the problem is the SQL table name so the second parameter is TYPE_TABLE. That call is further delegated to NameConverter.convert(). The problem is, at that level, you only have access to the table name itself, not the full sub-tree to inspect. Maybe, in NameConverterWorker you can access the resolver.getSourceAst() to obtain the current AST being processed.

Bottom line, I guess the best place to intercept is in NameConverterWorker$Library.convert(String, int, String, Map). After obtaining the result from the converter, test the type and whether its AST is involved in word indexes. If so, try to adjust the result obtained. Alternatively, you can do this for word index fields, too, if this helps avoiding SAL naming limits.

Thank you Ovidiu!

#142 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Yes, the information is mainly correct. I would look back one step, looking for the name.convert() calls in p2o.xml. I assume the problem is the SQL table name so the second parameter is TYPE_TABLE. That call is further delegated to NameConverter.convert(). The problem is, at that level, you only have access to the table name itself, not the full sub-tree to inspect. Maybe, in NameConverterWorker you can access the resolver.getSourceAst() to obtain the current AST being processed.

Bottom line, I guess the best place to intercept is in NameConverterWorker$Library.convert(String, int, String, Map). After obtaining the result from the converter, test the type and whether its AST is involved in word indexes. If so, try to adjust the result obtained. Alternatively, you can do this for word index fields, too, if this helps avoiding SAL (LE: SQL) naming limits.

Ovidiu.
I see the following problem with this approach. I understand that NameConverterWorker is dialect-neutral while naming limits depend on the dialect. It seems that we have to deal with the names' limit at the DDL generation step(?). Am I right?
I also see another possible problem. In the NameConverterWorker$Library.convert(String, int, String, Map) we check and resolve only conflicts between table names, while the table name cannot be the same as e.g. index name (at least for PostgreSQL). Or I miss something?
Thank you.

#143 Updated by Igor Skornyakov over 3 years ago

Well, I think I've found a 'right' place to add information about word table names (there can be different names for different dialects) to the Index annotation of the table DMO interface.
Now I need to access this data from the FqlToSQLConverter. What is the best way to get the table DMO by the table name from this class?
Thank you.
BTW: I understand that it is possible that there is more than one word index for the same field (although it doesn't make sense). I'm not 100% sure that my code will work correctly in this case, but I do not think that it is a realistic scenario.

#144 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

I see the following problem with this approach. I understand that NameConverterWorker is dialect-neutral while naming limits depend on the dialect. It seems that we have to deal with the names' limit at the DDL generation step(?). Am I right?

I thought we use the least-common-denominator approach for the names (see Eric's entry #1587-136). It makes it easier to maintain an debug the code. You know exactly what you look for in database. The DDL generation step may be too late, the values computed in p2o might have been used in other places, too. Nevertheless, I understand that what you need is a name for the new special hidden tables, only accessed by the new code, word-index related, invisible to the rest of code. In this case, the names can be selected at any moment provided that they do not collide with the existing table names.

I also see another possible problem. In the NameConverterWorker$Library.convert(String, int, String, Map) we check and resolve only conflicts between table names, while the table name cannot be the same as e.g. index name (at least for PostgreSQL). Or I miss something?

The index names are prefixed by idx__ so the collision is excluded.

#145 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

Well, I think I've found a 'right' place to add information about word table names (there can be different names for different dialects) to the Index annotation of the table DMO interface.
Now I need to access this data from the FqlToSQLConverter. What is the best way to get the table DMO by the table name from this class?

If I understand correctly, you need to map the DMOs (DmoMeta most likely) using SQL names as keys. There is no such option implemented because SQL names (tables and columns are the 'final' product of our processing).

BTW: I understand that it is possible that there is more than one word index for the same field (although it doesn't make sense). I'm not 100% sure that my code will work correctly in this case, but I do not think that it is a realistic scenario.

If it is possible in OE/4GL, we need to support it too, even if it does not make sense.

#146 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

I thought we use the least-common-denominator approach for the names (see Eric's entry #1587-136). It makes it easier to maintain an debug the code. You know exactly what you look for in database. The DDL generation step may be too late, the values computed in p2o might have been used in other places, too. Nevertheless, I understand that what you need is a name for the new special hidden tables, only accessed by the new code, word-index related, invisible to the rest of code. In this case, the names can be selected at any moment provided that they do not collide with the existing table names.

Oh< I forgot about the idea to use least-common-denominator approach for the names, sorry. Anyway, I've managed to add info about the word table name to the DMO from DDLGeneratorWorker with only a minor change in the dmo_common.rules. At this moment I do not use the least-common-denominator approach for the names, but it can be easily done, although it will not significantly simplify the logic.

I also see another possible problem. In the NameConverterWorker$Library.convert(String, int, String, Map) we check and resolve only conflicts between table names, while the table name cannot be the same as e.g. index name (at least for PostgreSQL). Or I miss something?

The index names are prefixed by idx__ so the collision is excluded.

Does it mean that since word tables' name is a concatenation of the table name and the field name separated by '_' the collision can be only with the name of the word table and only after truncating?

#147 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

At this moment I do not use the least-common-denominator approach for the names, but it can be easily done, although it will not significantly simplify the logic.

Please do make the change. Even if we don't simplify the logic much, we want to keep the converted code portable across database implementations.

#148 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

If I understand correctly, you need to map the DMOs (DmoMeta most likely) using SQL names as keys. There is no such option implemented because SQL names (tables and columns are the 'final' product of our processing).

Yes, I need exactly this. I understand that the DMOs appears during the processing. Maybe it is possible to save the link somewhere?

BTW: I understand that it is possible that there is more than one word index for the same field (although it doesn't make sense). I'm not 100% sure that my code will work correctly in this case, but I do not think that it is a realistic scenario.

If it is possible in OE/4GL, we need to support it too, even if it does not make sense.

OK, I will double-check and test.

#149 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Please do make the change. Even if we don't simplify the logic much, we want to keep the converted code portable across database implementations.

OK, I will do this. Please note however that it affects only DDL statements which are dialect-dependent anyway.

#150 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

Please do make the change. Even if we don't simplify the logic much, we want to keep the converted code portable across database implementations.

OK, I will do this. Please note however that it affects only DDL statements which are dialect-dependent anyway.

What about the Java annotation in the DMO interface? My point is, I just want one name stored in one annotation per word index, which is safe to use with all supported database dialects, not one name/annotation per dialect.

#151 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

What about the Java annotation in the DMO interface? My point is, I just want one name stored in one annotation per word index, which is safe to use with all supported database dialects, not one name/annotation per dialect.

At this moment the Index annotation looks like this:

   @Index(name = "ttwi_wi1", legacy = "wi1", word = true, components = 
   {
      @IndexComponent(name = "if1", legacy = "if1")
   }, wordtablenames = "h2:ttwi_if1;postgresql:ttwi_if1"),

But I will simplify this just to wordtablename = "ttwi_if1"

#152 Updated by Igor Skornyakov over 3 years ago

Short names' generation committed to 1587b revision 11849.
Working on using generated names in FqlToSqlConverter

#153 Updated by Igor Skornyakov over 3 years ago

Branch 1587b was rebased to 3821c revision 11860.
Pushed up to revision 11870.

#154 Updated by Igor Skornyakov over 3 years ago

I've finally decided to augment MetadataManager with word tables support. So the implementation for PostgreSQL is essentially finished and is ready for the code review.
Committed to 1587b revision 11872.

#155 Updated by Igor Skornyakov over 3 years ago

Branch 1587b was rebased to 3831c revision 11884. Pushed up to revision 11898.

#156 Updated by Eric Faulhaber over 3 years ago

Igor, is there a way I can test this new word index implementation, without having to re-convert a large application? A test case or set of cases?

#157 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Igor, is there a way I can test this new word index implementation, without having to re-convert a large application? A test case or set of cases?

Eric,
I used fwd/testcases/ias/words.p for testing. The fwd1 database definition used for the conversion and corresponding .d. files can be found in the data subfolder.

#158 Updated by Greg Shah over 3 years ago

Does the server startup still report (com.goldencode.p2j.persist.orm.RecordMeta:WARNING) Word indices are not supported in this version of FWD: ... in 1587b?

#159 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Does the server startup still report (com.goldencode.p2j.persist.orm.RecordMeta:WARNING) Word indices are not supported in this version of FWD: ... in 1587b?

Yes, it does. But I will fix it shortly

#160 Updated by Eric Faulhaber over 3 years ago

Code review 1587b/11898:

Nice work, overall. Some questions/comments...

Currently, the H2 dialect specifies that the CONTAINS UDF should be used, and has a "TODO implement" comment for the generateTriggerDDLImpl method body. Is the implementation of this feature for the H2 dialect expected to be significantly different than for PostgreSQL? What about SQL Server? H2 is higher priority than SQL Server, but we must support both, ultimately.

FYI, we are replacing the Java-based UDF implementation with SQL or PL/pgSQL, so the CONTAINS UDF is not something we can rely on as a fallback long term. It is too complicated to re-implement this in PL/pgSQL, so we will need to make the word index support work on all dialects.

P2JSqlServer2012Dialect: this class extends P2JSqlServer2008Dialect. Is it necessary for both to override the new methods, or just 2008?

Until I test the changes, the remaining comments are for the most part about formatting, coding standards, missing doc. Most of the functional questions/concerns have been addressed previously. I have made some formatting updates myself, to save time. These were primarily about maintaining consistency within a file. They are saved in rev 11899.

Across all changed files, please remove the revision numbers from the file header entries. This branch will be merged into 3821c, which, for better or worse, is still all one (albeit massive) update.

dmo_common.rules: are the commented out conditions (e.g., !isWord) meant to be permanent removals? If so, please remove them and reformat the indented rules that are affected.

FqlType: missing javadoc. Why was the ARRAY type added to this and to DataTypeHelper? I think I missed where it is used.

Index: needs header entry.

The only other thing I had for the moment was the warning message that Greg already noted. We may want to leave that, but reword it, to reflect the fact that word indices are not supported (yet) by some dialects.

I will try to run your test cases shortly, and I may have more feedback as I review the implementation in action.

#161 Updated by Eric Faulhaber over 3 years ago

Eric Faulhaber wrote:

Across all changed files, please remove the revision numbers from the file header entries. This branch will be merged into 3821c, which, for better or worse, is still all one (albeit massive) update.

I should clarify: only remove the revision number, if a file already has been changed for 3821c. Of course, any file not otherwise touched by 3821c should have a new revision number for your 1587b changes.

#162 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Currently, the H2 dialect specifies that the CONTAINS UDF should be used, and has a "TODO implement" comment for the generateTriggerDDLImpl method body. Is the implementation of this feature for the H2 dialect expected to be significantly different than for PostgreSQL? What about SQL Server? H2 is higher priority than SQL Server, but we must support both, ultimately.

FYI, we are replacing the Java-based UDF implementation with SQL or PL/pgSQL, so the CONTAINS UDF is not something we can rely on as a fallback long term. It is too complicated to re-implement this in PL/pgSQL, so we will need to make the word index support work on all dialects.

I understand that we have to support all dialects, I've started with PostgreSQL as it seems more common and I'm most familiar with it. CONTAINS UDF is not supposed to be used for PostgreSQL at all but we will need a new words UDF (two signatures).

P2JSqlServer2012Dialect: this class extends P2JSqlServer2008Dialect. Is it necessary for both to override the new methods, or just 2008?

I was not working with MS SQL server for a long time so I'm not aware of the differences between 2008 and 2012. So I've added an override for both just in case if the overrides will be different (at least for some time).

Until I test the changes, the remaining comments are for the most part about formatting, coding standards, missing doc. Most of the functional questions/concerns have been addressed previously. I have made some formatting updates myself, to save time. These were primarily about maintaining consistency within a file. They are saved in rev 11899.

I see. Thank you.

Across all changed files, please remove the revision numbers from the file header entries. This branch will be merged into 3821c, which, for better or worse, is still all one (albeit massive) update.

Will be done.

dmo_common.rules: are the commented out conditions (e.g., !isWord) meant to be permanent removals? If so, please remove them and reformat the indented rules that are affected.

Will be done.

FqlType: missing javadoc. Why was the ARRAY type added to this and to DataTypeHelper? I think I missed where it is used.

The ARRAY type was added since words UDF returns an array and there was a compilation or runtime(?) error w/o this. The javadoc will be added.

Index: needs header entry.

Will be done.

The only other thing I had for the moment was the warning message that Greg already noted. We may want to leave that, but reword it, to reflect the fact that word indices are not supported (yet) by some dialects.

I will try to run your test cases shortly, and I may have more feedback as I review the implementation in action.

I see. Thank you.

#163 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Eric Faulhaber wrote:

Across all changed files, please remove the revision numbers from the file header entries. This branch will be merged into 3821c, which, for better or worse, is still all one (albeit massive) update.

I should clarify: only remove the revision number, if a file already has been changed for 3821c. Of course, any file not otherwise touched by 3821c should have a new revision number for your 1587b changes.

Got it. Thank you!

#164 Updated by Igor Skornyakov over 3 years ago

What is the best way to retrieve a field value by the field name from the BaseRecord instance?
Thank you.

#165 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

What is the best way to retrieve a field value by the field name from the BaseRecord instance?

We try to avoid doing this, because it is expensive, preferring instead to get it by its index in the BaseRecord.data array. So there is no API for it at the moment, but this is not always convenient. Can you explain from where you are trying to get the value and under what circumstance? We will figure out a way to expose it in the most efficient way possible.

#166 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

What is the best way to retrieve a field value by the field name from the BaseRecord instance?

We try to avoid doing this, because it is expensive, preferring instead to get it by its index in the BaseRecord.data array. So there is no API for it at the moment, but this is not always convenient. Can you explain from where you are trying to get the value and under what circumstance? We will figure out a way to expose it in the most efficient way possible.

Well, I need to get the values of fields with word indexes in ImportWorker.Library.importTable() after

                     session.save(dmo, false);

and
                     session.bulkSave(records);

#167 Updated by Eric Faulhaber over 3 years ago

See PropertyMapper.getIndex and BaseRecord.getData(PropertyMapper). You probably will need to do some refactoring, but you can use the information returned by these APIs to access a specific datum in the BaseRecord.data array. During import, we already are doing a map lookup at some level to get the PropertyMapper for a particular field. Can you piggy-back on this to get access to the data you need, without doing an additional map lookup?

#168 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

See PropertyMapper.getIndex and BaseRecord.getData(PropertyMapper). You probably will need to do some refactoring, but you can use the information returned by these APIs to access a specific datum in the BaseRecord.data array. During import, we already are doing a map lookup at some level to get the PropertyMapper for a particular field. Can you piggy-back on this to get access to the data you need, without doing an additional map lookup?

I see thank you. I was trying to do something like this now. However, if I will add a getter for the SqlRecordLoader.dmoClass I will be able to get the index of the field in question. Maybe it will be more efficient?
Thank you.

#169 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

See PropertyMapper.getIndex and BaseRecord.getData(PropertyMapper). You probably will need to do some refactoring, but you can use the information returned by these APIs to access a specific datum in the BaseRecord.data array. During import, we already are doing a map lookup at some level to get the PropertyMapper for a particular field. Can you piggy-back on this to get access to the data you need, without doing an additional map lookup?

I see thank you. I was trying to do something like this now. However, if I will add a getter for the SqlRecordLoader.dmoClass I will be able to get the index of the field in question. Maybe it will be more efficient?

If you mean by accessing the Java annotation, this is a relatively slow operation. Also, before you do any direct access of Java annotations from the DMO class, note that the information from these annotations already is read into a RecordMeta instance when the DMO is first loaded.

Is there a need to do this for every record? It seems like the determination of the indices you need (of elements in the data array, I mean, not word indices) need only be done once per table, not once per record. So, if you work this out once at the beginning of the table import, and store this information in an efficient data structure, it should be very fast to access the data for each record. In this case, it doesn't so much matter whether you use PropertyMapper.getIndex, or the RecordMeta instance, or even the Java annotation, since you only will be figuring this out once per table, potentially amortized over many thousands of records. Just do what requires the least refactoring and is fast.

#170 Updated by Igor Skornyakov over 3 years ago

I think that it is sufficient to add a lookup of PropertyMapper by property name to the RecordLoader. This will be used once per field with a word index per table and then used for all table records.

#171 Updated by Igor Skornyakov over 3 years ago

I've noticed that dmo.getData(PropertyMapper) returns an array of length 2 for the field with word index where the first element is the value of the field and the second is an empty string. Can I expect that it will be always a fact?
Thank you.

#172 Updated by Eric Faulhaber over 3 years ago

BaseRecord.getData(PropertyMapper) returns the entire data array for the record, not for any particular field. See the class javadoc for the orm.Loader class for the layout of that data array. I think Ovidiu chose to use the PropertyMapper argument to BaseRecord.getData(PropertyMapper) just to ensure the method can be invoked only from import, since public access to the internal data array generally is dangerous. It does not imply the data is associated with a particular field. You should only call getData once per record.

#173 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

BaseRecord.getData(PropertyMapper) returns the entire data array for the record, not for any particular field. See the class javadoc for the orm.Loader class for the layout of that data array. I think Ovidiu chose to use the PropertyMapper argument to BaseRecord.getData(PropertyMapper) just to ensure the method can be invoked only from import, since public access to the internal data array generally is dangerous. It does not imply the data is associated with a particular field. You should only call getData once per record.

I see. Thank you.

#174 Updated by Igor Skornyakov over 3 years ago

Need to add "wordtablename" annotation to the word index AST in the .p2o file on conversion.
Will highly appreciate a clue what is the best/right place for this.

#175 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

Need to add "wordtablename" annotation to the word index AST in the .p2o file on conversion.
Will highly appreciate a clue what is the best/right place for this.

If, at the time the p2o.xml ruleset runs, you have the name calculated and stored in a variable wordTableName, add a walk rule that does something like this:

<rule>
   type == prog.index               and
   (parent.type == prog.table      or
    parent.type == prog.temp_table or
    parent.type == prog.work_table) and
    getNoteBoolean("word")

   <action>putNote("wordtablename", wordTableName)</action>

</rule>

#176 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

If, at the time the p2o.xml ruleset runs, you have the name calculated and stored in a variable wordTableName, add a walk rule that does something like this:

[...]

Thank you!

#177 Updated by Igor Skornyakov over 3 years ago

Finished word tables' population on import for PostgreSQL.
Committed to 1587b revision 11900.
All objects related to the word tables (triggers and constraints) are created after import. See attached build_db.xml.

All changes mentioned in the code review (#1587-160) will be done tomorrow.

#178 Updated by Igor Skornyakov over 3 years ago

Branch 1587b was rebased to 3821c revision 11888. Pushed up to revision 11904.

#179 Updated by Igor Skornyakov over 3 years ago

Fixed issues mentioned in the code review (#1587-160) except the message about word indexes support.
Committed to 1587b reb 11904.

Since I have no other assignments at this moment I will work on the word tables support for H2.

#180 Updated by Igor Skornyakov over 3 years ago

I understand that to implement triggers for H2 one has to implement Java classes that depend on H2 API.
What is the right way to do it without introducing new dependencies to the core FWD code? The only option I see now is to create a separate project for these classes. Is it OK? On the other side, I understand that we use a customized version of H2. Maybe it is the right place? In this case, how can I access the sources of this?
Thank you.

#181 Updated by Greg Shah over 3 years ago

I understand that to implement triggers for H2 one has to implement Java classes that depend on H2 API.
What is the right way to do it without introducing new dependencies to the core FWD code?

We already have hard coded dependencies on H2. I will let Eric evaluate if the new dependencies are "too much" to include directly.

On the other side, I understand that we use a customized version of H2. Maybe it is the right place? In this case, how can I access the sources of this?

See H2 Database Fork.

#182 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

We already have hard coded dependencies on H2. I will let Eric evaluate if the new dependencies are "too much" to include directly.

The problem is that these dependencies will be in the jar which will be deployed to the database (like p2jpl.jar). I understand that we're planning to get rid of it for PostgreSQL, but I do not know if I can rely on this right now.

On the other side, I understand that we use a customized version of H2. Maybe it is the right place? In this case, how can I access the sources of this?

See H2 Database Fork.

Thank you!

#183 Updated by Eric Faulhaber over 3 years ago

I believe FWD will compile without the H2 jar (though I have not tried recently). So, I do not think we already have "hard-coded" dependencies at this time. However, this point is largely academic, because right now, FWD will not function properly without H2.

I do not foresee removing H2 from FWD; at least, at this time, there is no such plan. So, if you are concerned about implementing interfaces or extending classes which are intended for public use (e.g., org.h2.api.Trigger, org.h2.tools.TriggerAdapter), I think it is ok to do that directly in FWD.

Can you please describe the specific dependencies about which you are concerned?

#184 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

I believe FWD will compile without the H2 jar (though I have not tried recently). So, I do not think we already have "hard-coded" dependencies at this time. However, this point is largely academic, because right now, FWD will not function properly without H2.

I do not foresee removing H2 from FWD; at least, at this time, there is no such plan. So, if you are concerned about implementing interfaces or extending classes which are intended for public use (e.g., org.h2.api.Trigger, org.h2.tools.TriggerAdapter), I think it is ok to do that directly in FWD.

Can you please describe the specific dependencies about which you are concerned?

Thank you! All I need is to extend org.h2.tools.TriggerAdapter in two classes (one for AFTER UPDATE trigger and another for fo AFTER INSERT). I plan to put these classes to a separate FWD package and created a jar with them which will be deployed to H2 only.

#185 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

On the other side, I understand that we use a customized version of H2. Maybe it is the right place? In this case, how can I access the sources of this?

The customized version of H2 is about making changes to the internals of H2 to address performance or functional issues that cannot be addressed any other way. Implementations of triggers, UDFs, or other areas for which H2 was designed to be extended with external classes, should be done outside this code base, preferably in the FWD project itself.

All I need is to extend org.h2.tools.TriggerAdapter in two classes (one for AFTER UPDATE trigger and another for fo AFTER INSERT). I plan to put these classes to a separate FWD package and created a jar with them which will be deployed to H2 only.

H2 is used internally by FWD for legacy temp-table support, whether or not an external H2 database is ever used. Even when we have a primary H2 database, it usually is used in embedded mode, in the same JVM process as FWD. If it is possible to use a word index in a temp-table (I believe it is), these classes should be part of p2j.jar, I think, since the word index feature within a temp-table could not work without them. Is there a compelling reason to deploy them to a separate jar?

#186 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

On the other side, I understand that we use a customized version of H2. Maybe it is the right place? In this case, how can I access the sources of this?

The customized version of H2 is about making changes to the internals of H2 to address performance or functional issues that cannot be addressed any other way. Implementations of triggers, UDFs, or other areas for which H2 was designed to be extended with external classes, should be done outside this code base, preferably in the FWD project itself.

Got it. Thank you!

All I need is to extend org.h2.tools.TriggerAdapter in two classes (one for AFTER UPDATE trigger and another for fo AFTER INSERT). I plan to put these classes to a separate FWD package and created a jar with them which will be deployed to H2 only.

H2 is used internally by FWD for legacy temp-table support, whether or not an external H2 database is ever used. Even when we have a primary H2 database, it usually is used in embedded mode, in the same JVM process as FWD. If it is possible to use a word index in a temp-table (I believe it is), these classes should be part of p2j.jar, I think, since the word index feature within a temp-table could not work without them. Is there a compelling reason to deploy them to a separate jar?

I was not thinking about temp-tables yet (but word indexes are supported for them in 4GL). As far as I remember at least in one place there is an explicit check that the database is not _TEMP. The triggers will be included in the p2j.jar My concern now is how not to deploy it to non-H2 database

#187 Updated by Igor Skornyakov over 3 years ago

I've encountered a strange error trying to connect to FWD H2 database with my favorite database client dbeaver. I've done it before many times (with H2 version 1.4.197), but now I get the following exception on connect:

2020-12-30 20:41:07.703 - EN_US_P2J
java.lang.RuntimeException: EN_US_P2J
    at org.h2.message.DbException.throwInternalError(DbException.java:293)
    at org.h2.value.CompareModeDefault.<init>(CompareModeDefault.java:27)
    at org.h2.value.CompareMode.getInstance(CompareMode.java:149)
    at org.h2.pagestore.PageStore.addMeta(PageStore.java:1698)
    at org.h2.pagestore.PageStore.readMetaData(PageStore.java:1622)
    at org.h2.pagestore.PageStore.recover(PageStore.java:1400)
    at org.h2.pagestore.PageStore.openExisting(PageStore.java:364)
    at org.h2.pagestore.PageStore.open(PageStore.java:290)
    at org.h2.engine.Database.getPageStore(Database.java:2674)
    at org.h2.engine.Database.open(Database.java:675)
    at org.h2.engine.Database.openDatabase(Database.java:307)
    at org.h2.engine.Database.<init>(Database.java:301)
    at org.h2.engine.Engine.openSession(Engine.java:74)
    at org.h2.engine.Engine.openSession(Engine.java:192)
    at org.h2.engine.Engine.createSessionAndValidate(Engine.java:171)
    at org.h2.engine.Engine.createSession(Engine.java:166)
    at org.h2.engine.Engine.createSession(Engine.java:29)
    at org.h2.engine.SessionRemote.connectEmbeddedOrServer(SessionRemote.java:340)
    at org.h2.jdbc.JdbcConnection.<init>(JdbcConnection.java:173)
    at org.h2.jdbc.JdbcConnection.<init>(JdbcConnection.java:152)
    at org.h2.Driver.connect(Driver.java:69)

I can connect with H2 Console so the problem is not critical, but dbeaver is much more convenient.
Any suggestions?
Thank you.

#188 Updated by Greg Shah over 3 years ago

FWD H2 databases are created with the p2jspi.jar provider that allows us to duplicate the sorting of the 4GL. From the error message I would guess that this provider is not found when the H2 database is initialized.

#189 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

FWD H2 databases are created with the p2jspi.jar provider that allows us to duplicate the sorting of the 4GL. From the error message I would guess that this provider is not found when the H2 database is initialized.

Thank you! A closer look reveals that dbeaver uses its own OpenJDK JRE version 11.0.9.1. It seems that our standard way to add a Text-Collator provider doesn't work with it. I will finish word table support for H2 using H2 Console.

#190 Updated by Igor Skornyakov over 3 years ago

I've encountered the following problem with H2 trigger when I try to insert multiple rows to the word table:
  • If I insert the rows one by one then only the first row is inserted and the database stops responding to any queries
  • If I use multi-row insert, e.g. VALUES(?,?),(?,?) then nothing is inserted and, again, the database stops responding to any queries
    Have anybody used H2 triggers inserting multiple rows?
    Thank you.

#191 Updated by Igor Skornyakov over 3 years ago

In addition to #1587-190.
After enabling debug mode I in the trace file that H2 inserts and removes the same records in the word table again and again.

#192 Updated by Igor Skornyakov over 3 years ago

I've noticed that for H2 the name of the CONTAINS UDF in the AST processed by FqlToSqlConverter is "contains_1".
Can I expect that it will always be the case for P2JH2Dialect?
Thank you.

#193 Updated by Igor Skornyakov over 3 years ago

Finished word tables' support for H2.
Committed to 1587b revision 11906.
See the required changes to the import of the database in the attached build_db.xml

#194 Updated by Greg Shah over 3 years ago

Eric: Please review and confirm if 1587b can be merged into 3821c.

#195 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

I've noticed that for H2 the name of the CONTAINS UDF in the AST processed by FqlToSqlConverter is "contains_1".
Can I expect that it will always be the case for P2JH2Dialect?
Thank you.

The previous implementation used the contains() methods from pl/Functions.java in order to provide an incomplete support for contains as an UDF. Because there are two overloaded variants of the method and H2 does not support calling such methods, we were forced to make the method name unique by appending the numeric index suffixes. Note that MS SQL Server also has the same issue, but the suffix is slightly different (they reflect the method signature instead of dry numbers). PostgreSQL supports method/function overloading so the UDF will NOT be suffixed in this case.

The UDFs (function aliases) are declared when the database is prepared (see H2Helper.prepareDatabase(Database) and SQLServerHelper.preparePermanentDatabase(Database)). Before reaching the FqlToSqlConverter, the fql is altered by HQLPreprocessor which has a map of overloadedFunctions for each database and will make sure the function/operator overloaded name is replaced by the unique name required by the specific dialect.

In conclusion, the answer is no, at this moment. The name of the function will be different for each dialect and even for same dialect the name is different if different signature variants are used.

However, you can make it stay the same by removing the affected methods from pl/Functions and pl/Operators. Or at least dropping the @HQLFunction annotation, which is the marker for methods exposed to SQL.

#196 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

The previous implementation used the contains() methods from pl/Functions.java in order to provide an incomplete support for contains as an UDF. Because there are two overloaded variants of the method and H2 does not support calling such methods, we were forced to make the method name unique by appending the numeric index suffixes. Note that MS SQL Server also has the same issue, but the suffix is slightly different (they reflect the method signature instead of dry numbers). PostgreSQL supports method/function overloading so the UDF will NOT be suffixed in this case.

The UDFs (function aliases) are declared when the database is prepared (see H2Helper.prepareDatabase(Database) and SQLServerHelper.preparePermanentDatabase(Database)). Before reaching the FqlToSqlConverter, the fql is altered by HQLPreprocessor which has a map of overloadedFunctions for each database and will make sure the function/operator overloaded name is replaced by the unique name required by the specific dialect.

In conclusion, the answer is no, at this moment. The name of the function will be different for each dialect and even for same dialect the name is different if different signature variants are used.

However, you can make it stay the same by removing the affected methods from pl/Functions and pl/Operators. Or at least dropping the @HQLFunction annotation, which is the marker for methods exposed to SQL.

I see. Thank you for the detailed explanations!

#197 Updated by Igor Skornyakov over 3 years ago

Fixed CONTAINS support for the temporary tables.
Committed to 1587b revision 11907.

Please note that for the temp tables the CONTAINS UDF is used at this moment.

Do we really need to use word tables in this case? After all, it is about performance, the result will be the same (apart from ordering in the absence of the ORDER BY clause). I understand that temporary tables normally contain not too many records and I doubt that the word tables will provide noticeably better performance for small tables in the in-memory database while the creation of the temp tables with word indices will be more expensive, as well as updates of such tables.
What do you think?
Thank you.

#198 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

Please note that for the temp tables the CONTAINS UDF is used at this moment.
Do we really need to use word tables in this case? After all, it is about performance, the result will be the same (apart from ordering in the absence of the ORDER BY clause).

I think we need to add support in this case, too. The previous implementation of contains was not implemented correctly, although it covered 80-90 percents of usual cases.

I understand that temporary tables normally contain not too many records and I doubt that the word tables will provide noticeably better performance for small tables in the in-memory database while the creation of the temp tables with word indices will be more expensive, as well as updates of such tables.

Indeed, you are right here. A specialized high-efficient implementation might also be acceptable as long as it works correctly. Consequently, this is the first one of the cause we need to do it the new way. The second cause is consistency: we must have a consistent implementation on all dialects we support with same predictable result regardless of the configured combination of persistent (H2 / PSQL, MSSQL) and temporary databases (only H2 here momently).

Eric, please let me/us know if you think otherwise.

#199 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

I think we need to add support in this case, too. The previous implementation of contains was not implemented correctly, although it covered 80-90 percents of usual cases.

I've completely re-implemented contains UDF. It is now based on the same conversion to RPN that is used by an approach based on word tables.

#200 Updated by Igor Skornyakov over 3 years ago

The import_db.xml file at sftp://ias@xfer.goldencode.com/opt/testcases/ contains tasks for h2 and postresql dialects only. Where can I find tasks for sqlserver2012?
It will save me some time and help to avoid incompatibilities with our standard approach.
Thank you.

#201 Updated by Eric Faulhaber over 3 years ago

Code review 1587b/11907:

The latest code changes look fine to me.

I would like you to please integrate the words.p test case into the test cases project more cleanly, however. Instead of storing this under an ias directory, please use a name that is indicative of the purpose of the test case, not the initials of the author. Also, do we really need a separate schema? Having a separate schema makes the setup and use of this one test case much more difficult than it needs to be. AFAICT, the changes are all additive. If so, please add the changes to the existing schema, and integrate the exported data files accordingly. This will eliminate the need to modify the configuration to run this one test case. Thanks.

The output I got from running the test case with FWD:

1 big elephant
2 deer with antler
3 small caterpillar
3

Is that last line expected and correct?

#202 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

The import_db.xml file at sftp://ias@xfer.goldencode.com/opt/testcases/ contains tasks for h2 and postresql dialects only. Where can I find tasks for sqlserver2012?
It will save me some time and help to avoid incompatibilities with our standard approach.
Thank you.

I'm not sure this exists. Ovidiu?

#203 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

I would like you to please integrate the words.p test case into the test cases project more cleanly, however. Instead of storing this under an ias directory, please use a name that is indicative of the purpose of the test case, not the initials of the author. Also, do we really need a separate schema? Having a separate schema makes the setup and use of this one test case much more difficult than it needs to be. AFAICT, the changes are all additive. If so, please add the changes to the existing schema, and integrate the exported data files accordingly. This will eliminate the need to modify the configuration to run this one test case. Thanks.

OK.

The output I got from running the test case with FWD:

[...]

Is that last line expected and correct?

Yes. this is correct. the last line is the total number of records

#204 Updated by Greg Shah over 3 years ago

Has this been tested with any of the large customer applications which we are working on right now?

#205 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Has this been tested with any of the large customer applications which we are working on right now?

No, it has not. For such a test, the application has to be re-converted. At least the database layer. The Java code will be the same.

#206 Updated by Greg Shah over 3 years ago

OK, please go ahead with such a test. If the results are good, then the changes can be merged back into 3821c.

#207 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

OK, please go ahead with such a test. If the results are good, then the changes can be merged back into 3821c.

OK. Should I suspend work on the MS SQL Server support?
Thank you.

#208 Updated by Ovidiu Maxiniuc over 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

The import_db.xml file at sftp://ias@xfer.goldencode.com/opt/testcases/ contains tasks for h2 and postresql dialects only. Where can I find tasks for sqlserver2012?

I'm not sure this exists. Ovidiu?

I cannot find any import_db.xml in xfer project. Also, I always used a .bat script when importing on my Windows test machine, the automated script was not available yet. I do not have access momentarily.
However, here is how you can configure a project.
  1. p2j.cfg.xml: add sqlserver2012 to list value of ddl-dialects parameter in the required namespace(s), to be suere the needed artefacts are created at conversion time;
  2. run the import like this:
    java -server -Xmx3g -Duser.timezone=GMT -classpath p2j/build/lib/p2j.jar:build/lib/<app-dmo>.jar:cfg: -DP2J_HOME=. com.goldencode.p2j.pattern.PatternEngine -d 2 "dbName=\"<app-db>\"" "targetDb=\"sqlserver\"" "url=\"jdbc:sqlserver://localhost;instanceName=<instName>;databaseName=<dbName>\"" "uid=\"<user>\"" "pw=\"<pwd>\"" "maxThreads=3" schema/import data/namespace <app>.p2o
    

    Please update the bracketed placeholders, accordingly.
  3. if you need to run the server, use the following orm block:
                  <node class="string" name="dialect">
                    <node-attribute name="value" value="com.goldencode.p2j.persist.dialect.P2JSQLServer2012Dialect"/>
                  </node>
                  <node class="container" name="connection">
                    <node class="string" name="password">
                      <node-attribute name="value" value="<pwd&gt;"/>
                    </node>
                    <node class="string" name="username">
                      <node-attribute name="value" value="<usr&gt;"/>
                    </node>
                    <node class="string" name="driver_class">
                      <node-attribute name="value" value="com.microsoft.sqlserver.jdbc.SQLServerDriver"/>
                    </node>
                    <node class="string" name="url">
                      <node-attribute name="value" value="jdbc:sqlserver://localhost;instanceName=<instName&gt;;databaseName=<dbName&gt;"/>
                    </node>
                    <node class="integer" name="prepareThreshold">
                      <node-attribute name="value" value="1"/>
                    </node>
                  </node>
    

#209 Updated by Greg Shah over 3 years ago

OK, please go ahead with such a test. If the results are good, then the changes can be merged back into 3821c.

OK. Should I suspend work on the MS SQL Server support?

I think that the conversion and import will take quite some time, right? These are just long running batch processes. This means it can be running with a recent revision of your changes while (in parallel) you continue to work on SQLServer support.

#210 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

I think that the conversion and import will take quite some time, right? These are just long running batch processes. This means it can be running with a recent revision of your changes while (in parallel) you continue to work on SQLServer support.

I see. Thank you.

#211 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

I'm not sure this exists. Ovidiu?

I cannot find any import_db.xml in xfer project. Also, I always used a .bat script when importing on my Windows test machine, the automated script was not available yet. I do not have access momentarily.

Sorry, I meant build_db.xml

However, here is how you can configure a project.
  1. p2j.cfg.xml: add sqlserver2012 to list value of ddl-dialects parameter in the required namespace(s), to be suere the needed artefacts are created at conversion time;
  2. run the import like this:
    [...]
    Please update the bracketed placeholders, accordingly.
  3. if you need to run the server, use the following orm block:
    [...]

Got it. Thank you.
BTW: I've never used C# code with MS SQL and FWD uses it as I can see. I understand that it can be done only on Windows and will not work with the Linux version of MS SQL. Is this correct?
Thank you.

#212 Updated by Ovidiu Maxiniuc over 3 years ago

Actually, there is no need for a full reconversion. Only the database support is required (ddls and p2o), so running the middle part of it should be enough to obtain these artefacts.

#213 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Actually, there is no need for a full reconversion. Only the database support is required (ddls and p2o), so running the middle part of it should be enough to obtain these artefacts.

I see. Thank you.

#214 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

Sorry, I meant build_db.xml

Nope. find /opt/testcases/ -iname build_db.xml still cannot find anything.

BTW: I've never used C# code with MS SQL and FWD uses it as I can see. I understand that it can be done only on Windows and will not work with the Linux version of MS SQL. Is this correct?

I cannot give you a straight answer here. I used a real Windows machine to run it (not in a VM). In main configuration, the conversion and FWD server were also located in the same box. I also run FWD application on Ubuntu but configured with SQL server remotely, from Windows box (on a 100MB link). I have zero experience with MSSQL on Linux.

#215 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Igor Skornyakov wrote:

Sorry, I meant build_db.xml

Nope. find /opt/testcases/ -iname build_db.xml still cannot find anything.

This is really strange:

ias@ias-gcdc-ws:/media/ias/ssd1T/_gcdc/fwd/testcases$ bzr status
added:
  ias/mouse.p
  ias/ttwi1.d
  ias/ttwi1.df
  ias/words-tmp.p
missing:
  db/user.e
modified:
  build.properties
  build.xml
  build_db.xml
  cfg/p2j.cfg.xml
  deploy/client-gui/client.xml
  deploy/server/directory.xml
  file-cvt-list.txt
ias@ias-gcdc-ws:/media/ias/ssd1T/_gcdc/fwd/testcases$ bzr status
added:
  ias/mouse.p
  ias/ttwi1.d
  ias/ttwi1.df
  ias/words-tmp.p
missing:
  db/user.e
modified:
  build.properties
  build.xml
  build_db.xml
  cfg/p2j.cfg.xml
  deploy/client-gui/client.xml
  deploy/server/directory.xml
  file-cvt-list.txt

BTW: I've never used C# code with MS SQL and FWD uses it as I can see. I understand that it can be done only on Windows and will not work with the Linux version of MS SQL. Is this correct?

I cannot give you a straight answer here. I used a real Windows machine to run it (not in a VM). In main configuration, the conversion and FWD server were also located in the same box. I also run FWD application on Ubuntu but configured with SQL server remotely, from Windows box (on a 100MB link). I have zero experience with MSSQL on Linux.

I see. Thank you. I've almost restored my Windows machine (Windows 7 x64). Hopefully, it is in a good shape enough to host MS SQL and C# compiler (I haven't used it for about two years).

#216 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

This is really strange:
[...]

Where is your testcases bound to?

I see. Thank you. I've almost restored my Windows machine (Windows 7 x64). Hopefully, it is in a good shape enough to host MS SQL and C# compiler (I haven't used it for about two years).

According to bzr, my last encounter with C# was in 2015 :). However, the binaries I used were not compiled on a Windows machine, they were cross-compiled using mono, IIRC. You should locate the wiki, the whole process (configuration of SQL instance, database, import) is documented in redmine.

#217 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Igor Skornyakov wrote:
Where is your testcases bound to?

ias@ias-gcdc-ws:/media/ias/ssd1T/_gcdc/fwd/testcases$ bzr info
Checkout (format: 2a)
Location:
       checkout root: .
  checkout of branch: sftp://ias@xfer.goldencode.com/opt/testcases/

I see. Thank you. I've almost restored my Windows machine (Windows 7 x64). Hopefully, it is in a good shape enough to host MS SQL and C# compiler (I haven't used it for about two years).

According to bzr, my last encounter with C# was in 2015 :). However, the binaries I used were not compiled on a Windows machine, they were cross-compiled using mono, IIRC. You should locate the wiki, the whole process (configuration of SQL instance, database, import) is documented in redmine.

I see. Thank you.

#218 Updated by Igor Skornyakov over 3 years ago

The branch 1587b was rebased to 3821c revision 11901.
Pushed up to revision 11920.

#219 Updated by Igor Skornyakov over 3 years ago

The import of the large customer database finished OK. However, I see in the log multiple strange errors related to the word tables.
The errors are actually of two different types - the missing word table name and missed PropertyMapper. The latter refers to the field which does not exist.
Analyzing...

#220 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

The import of the large customer database finished OK. However, I see in the log multiple strange errors related to the word tables.
The errors are actually of two different types - the missing word table name and missed PropertyMapper. The latter refers to the field which does not exist.
Analyzing...

I suspect that my partial re-conversion was not done right. Started full re-conversion.

#221 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

Igor Skornyakov wrote:

The import of the large customer database finished OK. However, I see in the log multiple strange errors related to the word tables.
The errors are actually of two different types - the missing word table name and missed PropertyMapper. The latter refers to the field which does not exist.
Analyzing...

I suspect that my partial re-conversion was not done right. Started full re-conversion.

After full re-conversion of the large customer app, the database import finished OK with a few errors. These errors are caused by the word indexes on the extent fields. The 4GL documentation I've seen does not mention that it is possible and FWD does not support it at this moment. Additional development is required.

#222 Updated by Igor Skornyakov over 3 years ago

BTW: the CONTAINS operator support based on UDF is also incorrect. The generated SQL statement still contains (for H2)

contains_1(upper(rtrim(<field>)), <expression>)

and the client silently crashes.

#223 Updated by Igor Skornyakov over 3 years ago

It seems that in 4GL the value of the CONTAINS operator for the extent field is an OR of CONTAINS value for all its components.

#224 Updated by Greg Shah over 3 years ago

Igor Skornyakov wrote:

It seems that in 4GL the value of the CONTAINS operator for the extent field is an OR of CONTAINS value for all its components.

Please implement whatever is missing.

For my understanding, are you referring to an unsubscripted reference (extent_field_name CONTAINS something)? Do we already properly handle the subscripted version (extent_field_name[index_expr] CONTAINS something)?

#225 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

Igor Skornyakov wrote:

It seems that in 4GL the value of the CONTAINS operator for the extent field is an OR of CONTAINS value for all its components.

Please implement whatever is missing.

Sure.

For my understanding, are you referring to an unsubscripted reference (extent_field_name CONTAINS something)? Do we already properly handle the subscripted version (extent_field_name[index_expr] CONTAINS something)?

I mean extent_field_name CONTAINS something. I understand that (extent_field_name[index_expr] CONTAINS something is not supported by 4GL, but I will double-check.

#226 Updated by Igor Skornyakov over 3 years ago

I mean extent_field_name CONTAINS something. I understand that (extent_field_name[index_expr] CONTAINS something is not supported by 4GL, but I will double-check.

Indeed, extent_field_name[index_expr] CONTAINS something is not supported by 4GL.
The compilation warning:

** WARNING -- subscript on array field in CONTAINS phrase ignored. (1688)

However, the code is executed with an ignored subscript.

FWD handles this incorrectly.
The 4GL code

for each ttwi where ifext[4] contains "word1|word2":
    message ifext[1].
end.

Is converted to
         forEach("loopLabel0", new Block((Init) () -> 
         {
            query0.initialize(ttwi, "contains(upper(ttwi.ifext[3]), 'WORD1|WORD2')", null, "ttwi.ifext asc");
         }, 
         (Body) () -> 
         {
            query0.next();
            message((character) new FieldReference(ttwi, "ifext", 0).getValue());
         }));

#227 Updated by Igor Skornyakov over 3 years ago

Is it OK if the discrepancy described in #1587-226 regarding subscript on array field in CONTAINS phrase will be fixed at runtime (in the FqlToSqlConverter)?
Thank you.

#228 Updated by Greg Shah over 3 years ago

Yes. In fact I think it must be fixed at runtime because otherwise we would not generate a warning at the same place as the 4GL.

#229 Updated by Igor Skornyakov over 3 years ago

In the import of a large customer database I've got 4 errors for a single table on populating the related word table:

ERROR: character with byte sequence 0xce 0x9c in encoding "UTF8" has no equivalent in encoding "LATIN1" 

The corresponding SQL statement is very simple:
INSERT INTO <table name> VALUES(?,?),(?,?)

where the first column is int8 and the second one is text.
With the debugger, I see no 0xce 0x9c neither is the SQL string nor in the arguments, however, some strings contain non-alphanumeric characters ('<' or '-')
The total number of records in the table is 246246 and only 4 of them cause this strange error.
Does anybody have any ideas?
I can reproduce this with a small test database.

#230 Updated by Ovidiu Maxiniuc over 3 years ago

There are some very similar messages posted in #3871.

However, this article has a short paragraph about iso8859-1 Client to 1252 conversion, if this is the problem. Most likely, it is the € (euro) sign, which is not present in WIN 1252 CP.

#231 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

There are some very similar messages posted in #3871.

This was a result of my changes. I've rolled back them now.

However, this article has a short paragraph about iso8859-1 Client to 1252 conversion, if this is the problem. Most likely, it is the € (euro) sign, which is not present in WIN 1252 CP.

0xce 0x9c looks like Greek uppercase M (https://vazor.com/unicode/c039C.html) u. The problem is that I do not see it in a debugger.

#232 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

0xce 0x9c looks like Greek uppercase M (https://vazor.com/unicode/c039C.html) u. The problem is that I do not see it in a debugger.

Indeed, a closer look reveals that in all these records the field contains greek "mu" (lowercase). It is interesting that this field was saved in the master table (via FWD orm), but after being split into words failed to be persisted to the word table via plain JDBC.

#233 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

Indeed, a closer look reveals that in all these records the field contains greek "mu" (lowercase). It is interesting that this field was saved in the master table (via FWD orm), but after being split into words failed to be persisted to the word table via plain JDBC.

It seems that the problem is that µ (Greek 'mu' in lowercase) does exist in LATIN1 character set, while Μ (Greek 'mu' in uppercase) does not.
At this moment I do not see a solution. We need to convert words to uppercase in the word table if the corresponding field in the master table is not case-sensitive.
Any ideas?
Thank you.

#234 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

It seems that the problem is that µ (Greek 'mu' in lowercase) does exist in LATIN1 character set, while Μ (Greek 'mu' in uppercase) does not.
At this moment I do not see a solution. We need to convert words to uppercase in the word table if the corresponding field in the master table is not case-sensitive.

SHOW server_encoding for my PostreSQL returns LATIN1.
Is this correct?
Thank you.

#235 Updated by Greg Shah over 3 years ago

It seems that the problem is that µ (Greek 'mu' in lowercase) does exist in LATIN1 character set, while Μ (Greek 'mu' in uppercase) does not.

Interesting.

Is the µ (Greek 'mu' in lowercase) character is included in string literals in the source code? Or is it in the exported data (.d files). If neither of these is the case, where does it come from?

One thing here which is inconsistent in the 4GL is about using uppercase or lowercase in case-insensitive comparisons. Long ago, we found that (at least for character types inside the 4GL code) the case insensitive comparisons used UPPERCASE. This can be seen in our comments from Text.compareTo() in the case insensitive path:

      // DO NOT use String.compareToIgnoreCase() since this lowercases and yields different
      // results for >, <, >= and <= forms when [ \ ^ _ ' characters are included in the operands
      return s1.toUpperCase().compareTo(s2.toUpperCase());

I think we need to understand why this µ (Greek 'mu' in lowercase) character is not causing issues in the 4GL but does cause issues for us. In other words, in the absence of conversion to UTF-8, comparisons of this character would be a problem in the 4GL if they indeed use uppercasing for case insensitive comparisons. So there is some processing here which we do not match.

On the other hand, the reason we only see this when processing data imports is because we are deliberately treating this data as LATIN1 whereas in the JVM we internally process all string comparison operations in UTF-16 where there is full support for the Greek alphabet. Whether we are processing the same as the 4GL is a different question, which would depend upon whether the Unicode lexicographical sorting matches the 4GL 8859-1 sorting.

#236 Updated by Igor Skornyakov over 3 years ago

Greg Shah wrote:

It seems that the problem is that µ (Greek 'mu' in lowercase) does exist in LATIN1 character set, while Μ (Greek 'mu' in uppercase) does not.

Interesting.

Is the µ (Greek 'mu' in lowercase) character is included in string literals in the source code? Or is it in the exported data (.d files). If neither of these is the case, where does it come from?

I have not found it in the source code, only in the .d file.

One thing here which is inconsistent in the 4GL is about using uppercase or lowercase in case-insensitive comparisons. Long ago, we found that (at least for character types inside the 4GL code) the case insensitive comparisons used UPPERCASE. This can be seen in our comments from Text.compareTo() in the case insensitive path:

[...]

I think we need to understand why this µ (Greek 'mu' in lowercase) character is not causing issues in the 4GL but does cause issues for us. In other words, in the absence of conversion to UTF-8, comparisons of this character would be a problem in the 4GL if they indeed use uppercasing for case insensitive comparisons. So there is some processing here which we do not match.

I do not know how exactly case-insensitive string comparison is implemented in 4GL database or how the UPPERCASE works in all corner cases, so I have no explanation.

On the other hand, the reason we only see this when processing data imports is because we are deliberately treating this data as LATIN1 whereas in the JVM we internally process all string comparison operations in UTF-16 where there is full support for the Greek alphabet. Whether we are processing the same as the 4GL is a different question, which would depend upon whether the Unicode lexicographical sorting matches the 4GL 8859-1 sorting.

I'm not sure how often the result set order depends on the character fields with non-ASCII chars at the beginning. But my guess that it is a relatively rare situation.

BTW: in LATIN1 µ is called not "Greek mu", but "micro", I think this is the reason why there is not its uppercase version in the codepage.

#237 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

BTW: in LATIN1 µ is called not "Greek mu", but "micro", I think this is the reason why there is not its uppercase version in the codepage.

This is an interesting observation. The latin micro (µ) is U00B5. The Greek letters μ and Μ are U03BC and U039C respectively. Notice that they are not the very same glyph. I do not know who does this but converting U00B5 to uppercase U039C is wrong.

#238 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

This is an interesting observation. The latin micro (µ) is U00B5. The Greek letters μ and Μ are U03BC and U039C respectively. Notice that they are not the very same glyph. I do not know who does this but converting U00B5 to uppercase U039C is wrong.

The problem is that we convert to uppercase using Java toUpperCase() method which works with Unicode strings and knows nothing about the codepage. On the other hand. PostgreSQL UPPER function for the LATIN1 database converts µ to µ (doesn't change it). This means that we can resolve the issue described in #1587-229 if on import we will convert to uppercase at the database side. This is easy to implement. Unfortunately, we have to do the same with string literals as the CONTAINS operator argument which requires changes in the conversion. (using generated toUpperCase for the expressions can be fixed at runtime).

#239 Updated by Ovidiu Maxiniuc over 3 years ago

Igor Skornyakov wrote:

The problem is that we convert to uppercase using Java toUpperCase() method which works with Unicode strings and knows nothing about the codepage.

Not the codepage is the problem. It seems to me like a flaw in Java's uppercase implementation. Here is why:

      char micro = '\u00B5';
      char mulc = '\u03BC';
      char muuc = '\u039C';
      System.out.println(micro + "" + mulc + "" + muuc + "->" + Character.toUpperCase(micro) + Character.toUpperCase(mulc) + Character.toUpperCase(muuc));
      System.out.println(Character.toUpperCase(micro) == Character.toUpperCase(mulc));

The output is:
µμΜ->ΜΜΜ
true

As you can see there are no CP involved. Maybe we should add a filter for this micro character when uppercasing it? I have not wrote/run the same code in 4GL yet.
Question: are there other characters which behave the same?

#240 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

PostgreSQL UPPER function for the LATIN1 database converts µ to µ (doesn't change it).

BTW: 4GL UPPER function also leaves µ as-is, at least if SESSION:CHARSET is ISO8859-1.

#241 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Not the codepage is the problem. It seems to me like a flaw in Java's uppercase implementation. Here is why:
[...]
The output is:
[...]

Well, maybe you're right. I've always tried to avoid dealing with codepages (and time zones) :) Not sure how the conversion to uppercase is defined in Unicode. However, Java toUpperCase is used in almost any Java program and I doubt that a bug in its implementation could remain unnoticed for decades.

Question: are there other characters which behave the same?

At this moment I'm busy with word indices for the extent fields support. I will try to answer your question when I will return to #3871

#242 Updated by Ovidiu Maxiniuc over 3 years ago

Now that is funny.
I peeked at java.lang.CharacterDataLatin1, method int toUpperCase(int) which is called from Character.toUpperCase(int) and saw this (see line 152, it starts with else):

    int toUpperCase(int ch) {
        int mapChar = ch;
        int val = getProperties(ch);

        if ((val & 0x00010000) != 0) {
            if ((val & 0x07FC0000) != 0x07FC0000) {
                int offset = val  << 5 >> (5+18);
                mapChar =  ch - offset;
            } else if (ch == 0x00B5) {
                mapChar = 0x039C;
            }
        }
        return mapChar;
    }

Of all characters in Latin1 charset, this is the only one that is handled individually! The others are either bitwise offset or let unchanged.

#243 Updated by Igor Skornyakov over 3 years ago

Ovidiu Maxiniuc wrote:

Now that is funny.
I peeked at java.lang.CharacterDataLatin1, method int toUpperCase(int) which is called from Character.toUpperCase(int) and saw this (see line 152, it starts with else):
[...]

Of all characters in Latin1 charset, this is the only one that is handled individually! The others are either bitwise offset or let unchanged.

Interesting. So it is "not a bug but a feature". However, it means the 4GL UPPER and Java toUpperCase are not compatible. Should we re-work FWD toUpperCase() functions implementation to match 4GL behavior?
Thank you.

#244 Updated by Greg Shah over 3 years ago

Should we re-work FWD toUpperCase() functions implementation to match 4GL behavior?

I've answered this in #5085.

#245 Updated by Greg Shah over 3 years ago

  • Related to Bug #5085: uppercasing and string comparisons are incorrect for some character sets added

#246 Updated by Igor Skornyakov over 3 years ago

Added word indices support for extent fields (conversion and import) for PostgreSQL and H2 dialects.
Committed to 1587b revision 11922.

#247 Updated by Igor Skornyakov over 3 years ago

Added runtime support for the word indices on extent fields.
Committed to 1587b revision 11923.
Please note that at this moment we still use CONTAINS UDF for temporary tables (see #1587-197, #1587-198, #1587-199). For these tables the runtime support for the word indices on extent fields is not fixed yet.
The issues described at #1587-226 and #1587-229 are also still not fixed at this moment.

#248 Updated by Igor Skornyakov over 3 years ago

Implementation of the 4GL behavior for extent_field_name[index_expr] CONTAINS something (see #1587-226) appears to be tricky. The problem is that the presence of the extent_field_name[index_expr] forces FqlToSqlConverter generate additional INNER JOIN and WHERE clause which are already in the output buffer when we realize that we have to re-write a CONTAINS UDF call.
I have almost resolved this problem but the final re-written SQL in the presence of the extent_field_name[index_expr] will be not exactly the same as w/o it, albeit they will be equivalent (produce the same result set).

#249 Updated by Igor Skornyakov over 3 years ago

I've found a clean solution for the extent_field_name[index_expr] CONTAINS something (see #1587-226). However, I have a problem with the 1688 warning.
As was discussed in #1587-99, #1587-102 the query is executed multiple times, and since it is re-written by FqlToSqlConverter the including the parameters it is not cached and converted multiple times.
Because of this, the 1688 warning is also shown multiple times.

Please also see my comment regarding the approach with incremental fetching in #1587-103. Generally speaking, this approach (as well a popular idea of "paging") works correctly only if all the job is done in the scope of a transaction with REPEATABLE READ isolation level which is in most cases prohibitively expensive

#250 Updated by Igor Skornyakov over 3 years ago

Finished runtime word tables' support for extent fields (H2 and PostgreSQL).
Committed to 1587b revision 11924.

#251 Updated by Igor Skornyakov over 3 years ago

Igor Skornyakov wrote:

I've found a clean solution for the extent_field_name[index_expr] CONTAINS something (see #1587-226). However, I have a problem with the 1688 warning.
As was discussed in #1587-99, #1587-102 the query is executed multiple times, and since it is re-written by FqlToSqlConverter the including the parameters it is not cached and converted multiple times.
Because of this, the 1688 warning is also shown multiple times.

I've finally implemented the logic which shows the 1688 warning once for every fql string per session. This is not how 4GL works but we cannot be 100% compatible in this regard anyway since 4GL reports warning at the compilation stage (a separate message for any case even if the queries are identical) while and we do it at runtime.

#252 Updated by Eric Faulhaber over 3 years ago

Igor Skornyakov wrote:

Igor Skornyakov wrote:

I've found a clean solution for the extent_field_name[index_expr] CONTAINS something (see #1587-226). However, I have a problem with the 1688 warning.
As was discussed in #1587-99, #1587-102 the query is executed multiple times, and since it is re-written by FqlToSqlConverter the including the parameters it is not cached and converted multiple times.
Because of this, the 1688 warning is also shown multiple times.

I've finally implemented the logic which shows the 1688 warning once for every fql string per session. This is not how 4GL works but we cannot be 100% compatible in this regard anyway since 4GL reports warning at the compilation stage (a separate message for any case even if the queries are identical) while and we do it at runtime.

I just reread the notes for this where you mention the 1688 error is a compile time error in the original environment, but the extent value is ignored. What does this actually mean in practical terms? Are all of the elements of the extent field checked by the query? Can you please post the simplest example of original 4GL code that does this, the resulting converted code, and the SQL query that is passed to the database as a result? I see the first 2 aspects of this in #1587-226 (unless the conversion has changed in the meantime), but I'm not clear on what the SQL looks like now for an extent field used with CONTAINS.

As for the error message, if the 4GL never reports this at runtime, we shouldn't, either, OR better: we log an error but do not change the control flow of the program. A compilation error in the 4GL would equate to a conversion warning in FWD, but we have not implemented most 4GL compilation errors as conversion warnings. This one would warrant it, however, since the syntax is allowed, but the runtime behavior is implicitly altered.

I'm sorry I did not catch this being a compilation error earlier.

#253 Updated by Igor Skornyakov over 3 years ago

Eric Faulhaber wrote:

I just reread the notes for this where you mention the 1688 error is a compile time error in the original environment, but the extent value is ignored. What does this actually mean in practical terms? Are all of the elements of the extent field checked by the query? Can you please post the simplest example of original 4GL code that does this, the resulting converted code, and the SQL query that is passed to the database as a result? I see the first 2 aspects of this in #1587-226 (unless the conversion has changed in the meantime), but I'm not clear on what the SQL looks like now for an extent field used with CONTAINS.

See #1587-223 for the semantic if the CONTAINS for an extent field.
The

As for the error message, if the 4GL never reports this at runtime, we shouldn't, either, OR better: we log an error but do not change the control flow of the program. A compilation error in the 4GL would equate to a conversion warning in FWD, but we have not implemented most 4GL compilation errors as conversion warnings. This one would warrant it, however, since the syntax is allowed, but the runtime behavior is implicitly altered.

I'm sorry I did not catch this being a compilation error earlier.

The re-written SQL query for e.g.

def var expre as char init "(word1* | word3*) & (word12c | word33c)".
def var n as int init 0.
for each ttwiabcdefghijklmnopqrstuvwxyxz where (f_ext[2] contains expre)  or  (f_ext contains 'word22a & word22b | word23a & word23b'):
...
end.

where f_ext is an extent filed, looks like
select 
    ttwiabcdef0_.recid as col_0_0_ 
from
    ttwiabcdefghijklmnopqrstuvwxyxz ttwiabcdef0_ 
where
    (ttwiabcdef0_.recid in (
     select distinct recid from ttwiabcdefghijklmnopqrstuvwxyxz t
        join ttwiabcdefghijklmnopqrstuvwxyxz__5 e on (e.parent__id = t.recid)
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word = 'WORD13C' or w1.word = 'WORD12C'))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word like 'WORD13%' or w2.word like 'WORD12%'))
)) or (ttwiabcdef0_.recid in (
     select distinct recid from ttwiabcdefghijklmnopqrstuvwxyxz t
        join ttwiabcdefghijklmnopqrstuvwxyxz__5 e on (e.parent__id = t.recid)
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word = 'WORD23B' or w1.word = 'WORD22B'))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = 'WORD23B' or w2.word = 'WORD22A'))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w3 on (w3.parent__id = e.parent__id and w3.list__index  = e.list__index and (w3.word = 'WORD23A' or w3.word = 'WORD22B'))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w4 on (w4.parent__id = e.parent__id and w4.list__index  = e.list__index and (w4.word = 'WORD23A' or w4.word = 'WORD22A'))
))
order by
    upper(rtrim(ttwiabcdef0_.if1abcdefghijklmnopqrstuvwxyxz)) asc, ttwiabcdef0_.recid asc

As for the error message, if the 4GL never reports this at runtime, we shouldn't, either, OR better: we log an error but do not change the control flow of the program. A compilation error in the 4GL would equate to a conversion warning in FWD, but we have not implemented most 4GL compilation errors as conversion warnings. This one would warrant it, however, since the syntax is allowed, but the runtime behavior is implicitly altered.

It was my suggestion to deal with warning 1688 at runtime and Grep agreed with this (see #1587-227, #1587-228). I was thinking about writing a warning message to the log instead of a modal dialog. This can be done easily but I believe that we still have to avoid repeating this warning many times. In any case the main problem (for me) was in fixing the FAST to process the query ignoring the subscript.

#254 Updated by Eric Faulhaber about 3 years ago

Thanks for the details.

Is the case of the f_ext field being denormalized handled? If so, is it handled above the FQL->SQL conversion layer, so that this layer can be unaware of normalized vs. denormalized extent fields?

#255 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Thanks for the details.

Is the case of the f_ext field being denormalized handled? If so, is it handled above the FQL->SQL conversion layer, so that this layer can be unaware of normalized vs. denormalized extent fields?

No, the denormalized extents are not handled yet. What is the right place to add this?
Thank you.

#256 Updated by Eric Faulhaber about 3 years ago

Code review 1587b/11924:

I've made various, minor formatting and javadoc fixes and committed these as rev 11925. Other than this, just a few questions/comments:

  • I really don't like that we have awareness of the 1688 warning (or in fact even the Persistence-level query cache) at the ORM Session level. These are implementation details that belong at a higher level, not in the ORM code. There must be a cleaner way to implement this.
  • This point may become moot when the first item is resolved, but why use a concurrent set for Persistence$Context.warn1688 when the only access already is within a synchronized block on a static resource?
  • Please add javadoc for DDLGeneratorWorker$Helper.wordTablesKeys.

Please address these issues and then focus on the performance test case.

#257 Updated by Igor Skornyakov about 3 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

Thanks for the details.

Is the case of the f_ext field being denormalized handled? If so, is it handled above the FQL->SQL conversion layer, so that this layer can be unaware of normalized vs. denormalized extent fields?

No, the denormalized extents are not handled yet. What is the right place to add this?
Thank you.

There are three places affected by the word tables' support
  1. DDL generation
  2. Data import
  3. SQL re-writing
    As I can see now in all cases will I need is a single flag indicating that the extent is normalized to not. The rest is very simple.

#258 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Code review 1587b/11924:

I've made various, minor formatting and javadoc fixes and committed these as rev 11925. Other than this, just a few questions/comments:

Thank you.

  • I really don't like that we have awareness of the 1688 warning (or in fact even the Persistence-level query cache) at the ORM Session level. These are implementation details that belong at a higher level, not in the ORM code. There must be a cleaner way to implement this.

I also do not like this, but I have found no other way to avoid multiple warnings on every step of the incremental result set retrieval.

  • This point may become moot when the first item is resolved, but why use a concurrent set for Persistence$Context.warn1688 when the only access already is within a synchronized block on a static resource?

I will take a look.

  • Please add javadoc for DDLGeneratorWorker$Helper.wordTablesKeys.

OK.

Please address these issues and then focus on the performance test case.

I'm working now on the extent fields in the _temp database support. Without this, the converted code may not work at all. I hope to finish it today.

#259 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

There are three places affected by the word tables' support
  1. DDL generation
  2. Data import
  3. SQL re-writing
    As I can see now in all cases will I need is a single flag indicating that the extent is normalized to not. The rest is very simple.

Regarding the SQL rewriting, are you sure the FQL code coming from the HQLPreprocessor (we need to rename this, eventually) is not already appropriate for FQL->SQL conversion in this case? Off the top of my head, I'm not sure what the FQL query would look like in the denormalized case.

#260 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I'm working now on the extent fields in the _temp database support. Without this, the converted code may not work at all. I hope to finish it today.

What is missing in this implementation?

#261 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Regarding the SQL rewriting, are you sure the FQL code coming from the HQLPreprocessor (we need to rename this, eventually) is not already appropriate for FQL->SQL conversion in this case? Off the top of my head, I'm not sure what the FQL query would look like in the denormalized case.

In the denormalized case, the re-written SQL should be exactly the same as for the non-extent field (the same is about DDL generation and import). So the only thing I need is a flag indicating that the extent is denormalized.

#262 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

I'm working now on the extent fields in the _temp database support. Without this, the converted code may not work at all. I hope to finish it today.

What is missing in this implementation?

As you know we use CONTAINS UDF for _temp now. The corresponding SQL needed to be re-written for the case of the extent field since it should be applied to the field in an extent table (in a normalized case)

#263 Updated by Igor Skornyakov about 3 years ago

How can I run the conversion to use denormalized extent fields?
Thank you.

#264 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

How can I run the conversion to use denormalized extent fields?
Thank you.

I just remembered this needs to be added to the documentation. It is done using schema hints. If you have a data/test.df file, create a hint file named data/namespace/test.schema.hints. The syntax for the hints is described here: #2134-86.

#265 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

How can I run the conversion to use denormalized extent fields?
Thank you.

I just remembered this needs to be added to the documentation. It is done using schema hints. If you have a data/test.df file, create a hint file named data/namespace/test.schema.hints. The syntax for the hints is described here: #2134-86.

I see. Thank you.

#266 Updated by Greg Shah about 3 years ago

It was my suggestion to deal with warning 1688 at runtime and Grep agreed with this (see #1587-227, #1587-228).

I missed the fact that this was a compilation warning. The warning should be raised at conversion time. In regard to the part where the code can still be executed in some cases, this suggests that the 4GL compiler drops the bad syntax (the subscript parts) and lets the rest compile. We can do the same thing.

If there is no runtime error or warning, then we should not have one either. By doing everything at compile time, can we get the same behavior as the 4GL? If so, then this also has the benefit of eliminating deep knowledge of this inside the persistence layer.

The only place where I could expect a runtime error/warning is for dynamic queries.

#267 Updated by Igor Skornyakov about 3 years ago

Greg Shah wrote:

It was my suggestion to deal with warning 1688 at runtime and Grep agreed with this (see #1587-227, #1587-228).

I missed the fact that this was a compilation warning. The warning should be raised at conversion time. In regard to the part where the code can still be executed in some cases, this suggests that the 4GL compiler drops the bad syntax (the subscript parts) and lets the rest compile. We can do the same thing.

If there is no runtime error or warning, then we should not have one either. By doing everything at compile time, can we get the same behavior as the 4GL? If so, then this also has the benefit of eliminating deep knowledge of this inside the persistence layer.

The only place where I could expect a runtime error/warning is for dynamic queries.

Got it. Thank you. Will be re-worked. In fact, I can add a warning at the conversion but still re-write the generated code at runtime as now, but silently. This will retain the initial idea that the converted Java code should not be affected. Actually, at this moment the changes at the conversion are "pure add-on" - only new database objects are added w/o changing the old ones. This will in particular greatly simplify the performance testing and comparison with UDF-based approach.
Is it OK?
Thank you.

#268 Updated by Eric Faulhaber about 3 years ago

From email...


From Igor:

Do we have any repository for the temporary tables' DMOs like MetadataManager for regular tables?
All I need is to figure if the the field is an extend one by table and column names.
Thank you

From Ovidiu:

Each property/field of both temp and permanent tables, orm.Property has the extent member which will let you know that. If it is 0, it was a scalar field. Otherwise it was originally an extent. Also, index member, which is set only for denormalized fields. If 0 then the field is scalar, otherwise will reflect the original extent of the property.

How do you obtain the Property? From many places:

dmoProperty of PropertyMeta;
DmoMeta of each buffer;
already as parameter.

From Igor:

I need this in the FqlToSqlConverter.toSQL(). There is no information about DMOs at this point (at least explicitly), only names. From the regular tables I use MetadataManager but I do not see any counterpart for temporary tables. Do we have any?
Thank you

#269 Updated by Eric Faulhaber about 3 years ago

I was hoping to keep FqlToSqlConverter clean w.r.t. denormalization knowledge, and to avoid additional lookups at this level, when this information already has been detected at conversion and at higher levels in the runtime. Do you have test cases for denormalized extent fields which which produce FQL that does not already convert to SQL naturally?

#270 Updated by Ovidiu Maxiniuc about 3 years ago

From Igor:
I need this in the FqlToSqlConverter.toSQL(). There is no information about DMOs at this point (at least explicitly), only names. From the regular tables I use MetadataManager but I do not see any counterpart for temporary tables. Do we have any?

Please see

private ArrayDeque<HashMap<String, DmoMeta>> aliases = new ArrayDeque<>();

field member of FqlToSqlConverter. Use

private DmoMeta getScopedDmoInfo(String aliasStr)

method to obtain the dmo meta for a specific alias. Then use its

public Property getProperty(int k)

or

public Property getFieldInfo(String propName)

to obtain the Property you need.

#271 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

I was hoping to keep FqlToSqlConverter clean w.r.t. denormalization knowledge, and to avoid additional lookups at this level, when this information already has been detected at conversion and at higher levels in the runtime. Do you have test cases for denormalized extent fields which which produce FQL that does not already convert to SQL naturally?

Eric,
At this moment we have problems with such a simple program:

define temp-table ttwi
    field if1 as character
    field ifext as character extent 5
    index wi1 is word-index if1 asc
    index wiext is word-index ifext asc.

// populate ttwi
...

for each ttwi where ifext contains "word1*|word2*":
    message ifext[1].
end.

since ifext does not exist in the temporary table (I understand that it is normalized by default).
Regarding the denormalization. It seems that in the FqlToSqlConverter we do not really need the information about the denormalization per se since the containsQuery method only checks if the WordIndexData.extentTableName is not null. This data is maintained by MetadataManager.
At least this is true for regular tables. Regarding the temporary ones, the situation is not clear for me at this moment.

#272 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

From Igor:
I need this in the FqlToSqlConverter.toSQL(). There is no information about DMOs at this point (at least explicitly), only names. From the regular tables I use MetadataManager but I do not see any counterpart for temporary tables. Do we have any?

Please see
[...]

field member of FqlToSqlConverter. Use
[...]

method to obtain the dmo meta for a specific alias. Then use its
[...]

or
[...]

to obtain the Property you need.

Thank you!

#273 Updated by Eric Faulhaber about 3 years ago

Although this is excellent information to document, I do want to reiterate that I want to avoid putting denormalization logic in FqlToSqlConverter, if possible. By the time a query which references denormalized fields reaches the SQL generation stage, it seems like the FQL should have been sufficiently processed that it should convert naturally to SQL, without having to figure this out again.

#274 Updated by Eric Faulhaber about 3 years ago

Igor, sorry, I missed your intermediate post. Digesting it now...

#275 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

I was hoping to keep FqlToSqlConverter clean w.r.t. denormalization knowledge, and to avoid additional lookups at this level, when this information already has been detected at conversion and at higher levels in the runtime. Do you have test cases for denormalized extent fields which which produce FQL that does not already convert to SQL naturally?

Eric,
At this moment we have problems with such a simple program:
[...]
since ifext does not exist in the temporary table (I understand that it is normalized by default).

I recall discussing this in the past, but I don't recall whether we ended up implementing it. If this is what you are seeing (i.e., normalized by default for temp-tables) for this test case, then we must have.

Regarding the denormalization. It seems that in the FqlToSqlConverter we do not really need the information about the denormalization per se since the containsQuery method only checks if the WordIndexData.extentTableName is not null. This data is maintained by MetadataManager.
At least this is true for regular tables. Regarding the temporary ones, the situation is not clear for me at this moment.

What does the converted code look like for this test case?

What specifically are the problems we are having with this test case?

#276 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

What does the converted code look like for this test case?

         forEach("loopLabel1", new Block((Init) () -> 
         {
            query1.initialize(ttwi, "contains(upper(ttwi.if1), 'CAT*|ANT*')", null, "ttwi.if1 asc");
         }, 
         (Body) () -> 
         {
            query1.next();
            message((character) new FieldReference(ttwi, "if1").getValue());
         }));

What specifically are the problems we are having with this test case?

I'm changing the code at this moment and cannot provide the exact error but it was about a missing column in the table which is understandable if the field is normalized.

#277 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I'm changing the code at this moment and cannot provide the exact error but it was about a missing column in the table which is understandable if the field is normalized.

When you have your local code at a stable point, please post the exact error message, as well as the stack at the time of the error. I just want to understand what is happening. I'm pretty sure I did not make the original CONTAINS UDF implementation denormalization-safe, because I wasn't aware at the time that extent fields could be used as the lvalue of the CONTAINS operator. I expect you inherited this limitation when you updated the UDF.

#278 Updated by Igor Skornyakov about 3 years ago

I've re-worked the SQL generations for CONTAINS for temporary tables with an extent field (see #1587-271). However, I've experienced a problem which I've not seen for regular tables:

[01/25/2021 22:41:56 GMT+03:00] (com.goldencode.p2j.persist.Persistence:WARNING) [0000000E:00000022:bogus-->local/_temp/primary] error executing query [from Ttwi_1_1__Impl__ as ttwi where ((ttwi._multiplex = ?0) and (contains_1(upper(rtrim(ttwi.ifext)), 'WORD1*|WORD2*')))  order by ttwi._multiplex asc, upper(rtrim(ttwi.ifext)) asc, ttwi.recid asc]
[01/25/2021 22:41:56 GMT+03:00] (com.goldencode.p2j.persist.Persistence:SEVERE) [0000000E:00000022:bogus-->local/_temp/primary] error executing query [from Ttwi_1_1__Impl__ as ttwi where ((ttwi._multiplex = ?0) and (contains_1(upper(rtrim(ttwi.ifext)), 'WORD1*|WORD2*')))  order by ttwi._multiplex asc, upper(rtrim(ttwi.ifext)) asc, ttwi.recid asc]
com.goldencode.p2j.persist.PersistenceException: Error while processing the SQL list
Caused by: org.h2.jdbc.JdbcSQLSyntaxErrorException: Column "TTWI_1_1__0_.IFEXT" not found; SQL statement:

select 
        ttwi_1_1__0_.recid as id0_, ttwi_1_1__0_._multiplex as column1_0_, ttwi_1_1__0_._errorFlag as column2_0_, ttwi_1_1__0_._originRowid as column3_0_, ttwi_1_1__0_._errorString as column4_0_, ttwi_1_1__0_._peerRowid as column5_0_, ttwi_1_1__0_._rowState as column6_0_, ttwi_1_1__0_.if1 as if7_0_ 
from
        tt1 ttwi_1_1__0_ 
where
        ttwi_1_1__0_._multiplex = ? and (recid in (select parent__id  from tt1__5 where contains_1(upper(ifext), "WORD1*|WORD2*")))
order by
        ttwi_1_1__0_._multiplex asc, upper(rtrim(ttwi_1_1__0_.ifext)) asc nulls last, ttwi_1_1__0_.recid asc
 limit ? [42122-200]

Looking how to fix the ORDER BY clause. In fact, I do not completely understand why upper(rtrim(ttwi_1_1__0_.ifext)) was added at all.

#279 Updated by Ovidiu Maxiniuc about 3 years ago

Igor Skornyakov wrote:

Looking how to fix the ORDER BY clause. In fact, I do not completely understand why upper(rtrim(ttwi_1_1__0_.ifext)) was added at all.

Please provide your definition of the table in ABL. If possible, the TempTableHelper.sqlCreateTables and TempTableHelper.sqlCreateIndexes for the ttwi table.

#280 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

Igor Skornyakov wrote:

Looking how to fix the ORDER BY clause. In fact, I do not completely understand why upper(rtrim(ttwi_1_1__0_.ifext)) was added at all.

Please provide your definition of the table in ABL. If possible, the TempTableHelper.sqlCreateTables and TempTableHelper.sqlCreateIndexes for the ttwi table.

The table definition in the ABL is (see #1587-271):

define temp-table ttwi
    field if1 as character
    field ifext as character extent 5
    index wi1 is word-index if1 asc
    index wiext is word-index ifext asc.

Generated DDL statements are:
create local temporary table tt1 (
   recid bigint not null,
   _multiplex integer not null,
   _errorFlag integer,
   _originRowid bigint,
   _errorString varchar,
   _peerRowid bigint,
   _rowState integer,
   if1 varchar,
   __iif1 varchar as upper(rtrim(if1)),
   primary key (recid)
) transactional;

create local temporary table tt1__5 (
   parent__id bigint not null,
   list__index integer not null,
   ifext varchar,
   __iifext varchar as upper(rtrim(ifext)),
   primary key (parent__id, list__index)
) transactional;

alter table tt1__5
   add constraint FK_TT1__5
   foreign key (parent__id)
   references tt1
   on delete cascade transactional;

create index idx_mpid__tt1__${1} on tt1 (_multiplex, recid ) transactional;

#281 Updated by Ovidiu Maxiniuc about 3 years ago

ifext was added to order by clause because wiext index was selected to 'drive' the query. Normally this should not be a problem because the extent fields cannot be part of normal indexes, so the column is found in the primary table. If with word-index this is permitted then we need to create the join with secondary table (ifext is stored in secondary table tt1__5). I have several questions here: which of the elements of the extent is indexed? All of them? How? Split into independent words?

However, there is one more issue I see here: I think __iifext should have been used instead, but this may also be caused because the extent fields are known not to be part of indexes and some rule failed to apply to them. You'll need to fix that also.

#282 Updated by Eric Faulhaber about 3 years ago

It seems this is a problem with index selection. See https://proj.goldencode.com/artifacts/javadoc/latest/api/com/goldencode/p2j/schema/package-summary.html#Determining_Query_Sorting_, in particular, the index selection rules subsection.

My understanding is that a word index can be selected for sorting, but an index containing an extent field cannot. Apparently, a word index can contain an extent field, unlike other types of index. So, which rule has precedence? I think in the denormalized case, we are stopping at rule 3 ("A word index referenced through contains"), because we are not eliminating the index earlier as an extent field (since it appears to have been denormalized). This may be a general index selection problem for denormalized extent fields, not necessarily limited to this word index use case.

My educated guess is that we should not be using the word index in the sort clause, due to the inclusion of the extent field, but I am not sure. We need to determine whether an extent field disqualifies a word index from being used as the sort driver. The only way to do this is through test cases. Once we understand the expected behavior, we can fix the index selection.

#283 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

ifext was added to order by clause because wiext index was selected to 'drive' the query. Normally this should not be a problem because the extent fields cannot be part of normal indexes, so the column is found in the primary table. If with word-index this is permitted then we need to create the join with secondary table (ifext is stored in secondary table tt1__5). I have several questions here: which of the elements of the extent is indexed? All of them? How? Split into independent words?

The CONTAINS operator value for the extent field is OR'ed values of the CONTAINS applied to all its components. I do not know how the word index is implemented in 4GL but for the regular tables, the word table contains three fields - parent__id, list__index, and word. See #1587-253 for a sample rewritten SQL. For the temporary tables, we still use UDF. See #1587-278 for a sample re-written SQL in the case of the extent field.

However, there is one more issue I see here: I think __iifext should have been used instead, but this may also be caused because the extent fields are known not to be part of indexes and some rule failed to apply to them. You'll need to fix that also.

Sorry. Should I use __iifext in the ORDER BY clause? Does it make sense?
Thank you.

#284 Updated by Eric Faulhaber about 3 years ago

Hm, I did not look carefully enough at the DDL. The presence of the tt1__5 table means that temp-tables are not denormalized by default, as was mentioned earlier.

But, as Ovidiu notes, this problem probably is still due to the conversion conflict between a word index being allowed to drive the sort, while an index with an extent field is disallowed. I think this still needs to be fixed in the index selection conversion, but we need to understand the original behavior through test cases.

#285 Updated by Eric Faulhaber about 3 years ago

To be more explicit regarding the test case requirement, please add a primary index to your test case temp-table, which sorts in a noticeably different way than the word index. Populate the table with values so that you can detect which index was selected, based on the way the results are sorted.

#286 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

It seems this is a problem with index selection. See https://proj.goldencode.com/artifacts/javadoc/latest/api/com/goldencode/p2j/schema/package-summary.html#Determining_Query_Sorting_, in particular, the index selection rules subsection.

My understanding is that a word index can be selected for sorting, but an index containing an extent field cannot. Apparently, a word index can contain an extent field, unlike other types of index. So, which rule has precedence? I think in the denormalized case, we are stopping at rule 3 ("A word index referenced through contains"), because we are not eliminating the index earlier as an extent field (since it appears to have been denormalized). This may be a general index selection problem for denormalized extent fields, not necessarily limited to this word index use case.

My educated guess is that we should not be using the word index in the sort clause, due to the inclusion of the extent field, but I am not sure. We need to determine whether an extent field disqualifies a word index from being used as the sort driver. The only way to do this is through test cases. Once we understand the expected behavior, we can fix the index selection.

My understanding is that word index can never be used for sorting, just to optimize the processing of the CONTAINS (like e.g. bitmap index in Oracle). Even if it somehow affects the ordering of the result set in the absence of the explicit ordering I do not think that it is feasible to understand the corresponding logic using "black box testing".

#287 Updated by Ovidiu Maxiniuc about 3 years ago

Igor Skornyakov wrote:

Should I use __iifext in the ORDER BY clause? Does it make sense?

Did you add ifext in the ORDER BY clause? I think this was done by existing code. Regardless, the __iifext is the ignore-case precomputed value of the ifext to be used as index component because the lack of support from some SQL dialects. PSQL is the only one that accepts expressions in index components so in this case __iifext is not generated and the index/order-by clauses use upper(trim(ifext)) instead.

#288 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Hm, I did not look carefully enough at the DDL. The presence of the tt1__5 table means that temp-tables are not denormalized by default, as was mentioned earlier.

But, as Ovidiu notes, this problem probably is still due to the conversion conflict between a word index being allowed to drive the sort, while an index with an extent field is disallowed. I think this still needs to be fixed in the index selection conversion, but we need to understand the original behavior through test cases.

As I wrote I do not think that it is possible to understand the "default" sort order in 4GL using "black box testing", but I would be very surprised if the value of the field used in the word index will be ever used for ordering. It doesn't make sense since the order of the words in this field is irrelevant, moreover, word separators are ignored and they can be found at the very beginning of the field value.

#289 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

Igor Skornyakov wrote:

Should I use __iifext in the ORDER BY clause? Does it make sense?

Did you add ifext in the ORDER BY clause? I think this was done by existing code. Regardless, the __iifext is the ignore-case precomputed value of the ifext to be used as index component because the lack of support from some SQL dialects. PSQL is the only one that accepts expressions in index components so in this case __iifext is not generated and the index/order-by clauses use upper(trim(ifext)) instead.

Yes, it was done by the existing code. My changes do not affect ORDER BY generation.

#290 Updated by Igor Skornyakov about 3 years ago

As well known, SQL standard explicitly says that in the absence of the ORDER BY clause the order of the SELECT result set is undefined. I've not found any information about the "default" order of the records retrieved by the query in 4GL. However, this article https://knowledgebase.progress.com/articles/Article/000012195 makes me think that word indexes are very close to the index by word for the word table (we do not have such index now but it can be easily added) and in the case of a simplest expression (just one word), the records should be ordered by word (not by the field value). What happens for more complicated expression is unclear (for me).

#291 Updated by Igor Skornyakov about 3 years ago

Igor Skornyakov wrote:

As well known, SQL standard explicitly says that in the absence of the ORDER BY clause the order of the SELECT result set is undefined. I've not found any information about the "default" order of the records retrieved by the query in 4GL. However, this article https://knowledgebase.progress.com/articles/Article/000012195 makes me think that word indexes are very close to the index by word for the word table (we do not have such index now but it can be easily added) and in the case of a simplest expression (just one word), the records should be ordered by word (not by the field value). What happens for more complicated expression is unclear (for me).

Indeed, at least for simple queries the records are ordered by matched words, not by the field.
Consider the following program:

define temp-table ttwi
    field if1 as character
    field ifext as character extent 5
    index wi1 is word-index if1 asc
    index wiext is word-index ifext asc.

create ttwi.
if1 = "aw1 wordc".
create ttwi.
if1 = "aw2 wordb".
create ttwi.
if1 = "aw3 worda".
OUTPUT TO words.out.
for each ttwi where if1 contains "word*":
    message if1.
end.
OUTPUT CLOSE.

The output is:
aw3 worda
aw2 wordb
aw1 wordc

#292 Updated by Igor Skornyakov about 3 years ago

The order of the records in case of more complicated queries (see #1587-291) is less obvious. Consider the following code:

define temp-table ttwi
    field if1 as character
    field ifext as character extent 5
    index wi1 is word-index if1 asc
    index wiext is word-index ifext asc.

create ttwi.
if1 = "aw1 wordc wrda".
create ttwi.
if1 = "aw2 wordb wrdb".
create ttwi.
if1 = "aw3 worda wrdc".

Then

for each ttwi where if1 contains "word* & wrd*":
    message if1.
end.

returns

aw3 worda wrdc
aw2 wordb wrdb
aw1 wordc wrda

while

for each ttwi where if1 contains "wrd* & word*":
    message if1.
end.

returns

aw1 wordc wrda
aw2 wordb wrdb
aw3 worda wrdc

The same with '&' replaced with '|'.

#293 Updated by Greg Shah about 3 years ago

I guess the leftmost contains word match is the most significant ordering constraint. Then the next word match and so on for sub-sorting.

#294 Updated by Igor Skornyakov about 3 years ago

Greg Shah wrote:

I guess the leftmost contains word match is the most significant ordering constraint. Then the next word match and so on for sub-sorting.

Yes, it looks something like this. However, at this moment I do not see how to enforce this in FWD. In the approach based on word tables (as well as on UDF) the order of terms in the condition is completely irrelevant.

#295 Updated by Ovidiu Maxiniuc about 3 years ago

Greg Shah wrote:

I guess the leftmost contains word match is the most significant ordering constraint. Then the next word match and so on for sub-sorting.

Yes, this looks like the pattern.
I have another question: what is the sort order of:

aw1 worda
aw3 worda
aw2 worda
? I.e., do the other words in the field matter?
What about
aw3 worda
aw1 worda
worda aw2
? Does the order of words matter?

#296 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

[...]? I.e., do the other words in the field matter?

As far as I can see, it doesn't matter.

What about
[...]? Does the order of words matter?

as far as I can see it doesn't matter as well.

#297 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Greg Shah wrote:

I guess the leftmost contains word match is the most significant ordering constraint. Then the next word match and so on for sub-sorting.

Yes, it looks something like this. However, at this moment I do not see how to enforce this in FWD. In the approach based on word tables (as well as on UDF) the order of terms in the condition is completely irrelevant.

Please see if you can come up with any test case which breaks this rule or which further refines it. Once we are reasonably confident, the next step is to come up with a way to express this in SQL and we can work backward from there to determine what needs to be done during conversion of such queries. It seems our usual approach of simply converting the index to an FQL sort clause is insufficient here, and we will need to address this as an exceptional case.

We have found that behavior sometimes varies between persistent table and temp-table, so please mirror the test cases for both types of tables, so that we don't make incorrect assumptions based on only one or the other.

#298 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Please see if you can come up with any test case which breaks this rule or which further refines it. Once we are reasonably confident, the next step is to come up with a way to express this in SQL and we can work backward from there to determine what needs to be done during conversion of such queries. It seems our usual approach of simply converting the index to an FQL sort clause is insufficient here, and we will need to address this as an exceptional case.

I'm thinking about it right now.

We have found that behavior sometimes varies between persistent table and temp-table, so please mirror the test cases for both types of tables, so that we don't make incorrect assumptions based on only one or the other.

Ah, this is important. Thank you! I will consider this.

#299 Updated by Igor Skornyakov about 3 years ago

There is some strange thing I've noticed regarding the generated sort parameter of the AdaptiveQuery.initialize() call.
For the permanent table

ADD TABLE "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  DUMP-NAME "ttwi" 

ADD FIELD "if1abcdefghijklmnopqrstuvwxyxz" OF "ttwiabcdefghijklmnopqrstuvwxyxz" AS character 
  FORMAT "x(128)" 
  INITIAL "" 
  POSITION 2
  MAX-WIDTH 256
  ORDER 10

ADD FIELD "if2abcdefghijklmnopqrstuvwxyxz" OF "ttwiabcdefghijklmnopqrstuvwxyxz" AS character 
  FORMAT "x(8)" 
  INITIAL "" 
  POSITION 3
  MAX-WIDTH 16
  ORDER 20
  CASE-SENSITIVE

ADD FIELD "f_ext" OF "ttwiabcdefghijklmnopqrstuvwxyxz" AS character 
  FORMAT "x(50)" 
  INITIAL "" 
  POSITION 4
  MAX-WIDTH 510
  EXTENT 5
  ORDER 30

  ADD INDEX "wi1" ON "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  WORD
  INDEX-FIELD "if1abcdefghijklmnopqrstuvwxyxz" ASCENDING 

ADD INDEX "wi2" ON "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  INACTIVE
  WORD
  INDEX-FIELD "if2abcdefghijklmnopqrstuvwxyxz" ASCENDING 

ADD INDEX "wiext" ON "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  INACTIVE
  WORD
  INDEX-FIELD "f_ext" ASCENDING 

the code fragment

def var expre as char init "(word1* | word3*) & (word12c | word33c)".
def var n as int init 0.
for each ttwiabcdefghijklmnopqrstuvwxyxz where (f_ext contains expre)  or  (f_ext contains 'word22a & word22b | word23a & word23b'):
    n = n + 1.
    message n f_ext[1].
end.

is converted to

         forEach("loopLabel1", new Block((Init) () -> 
         {
            RecordBuffer.openScope(ttwiabcdefghijklmnopqrstuvwxyxz);
            query1.initialize(ttwiabcdefghijklmnopqrstuvwxyxz, "contains(upper(ttwiabcdefghijklmnopqrstuvwxyxz.FExt), ?) or contains(upper(ttwiabcdefghijklmnopqrstuvwxyxz.FExt), 'WORD22A & WORD22B | WORD23A & WORD23B')", null, "ttwiabcdefghijklmnopqrstuvwxyxz.if1abcdefghijklmnopqrstuvwxyxz asc", new Object[]
            {
               toUpperCase(expre)
            });
         }, 
         (Body) () -> 
         {
            query1.next();
            n_1.assign(plus(n_1, 1));
            message(new Object[]
            {
               n_1,
               (character) new FieldReference(ttwiabcdefghijklmnopqrstuvwxyxz, "FExt", 0).getValue()
            });
 

(the sort is "ttwiabcdefghijklmnopqrstuvwxyxz.if1abcdefghijklmnopqrstuvwxyxz asc" which is not related to the CONTAINS argument)

For the temporary table

define temp-table ttwi
    field if1 as character
    field ifext as character extent 5
    index wi1 is word-index if1 asc
    index wiext is word-index ifext asc.

the code fragment

for each ttwi where ifext contains "word1*|word2*":
    message ifext[1].
end.

is converted to

         forEach("loopLabel0", new Block((Init) () -> 
         {
            query0.initialize(ttwi, "contains(upper(ttwi.ifext), 'WORD1*|WORD2*')", null, "ttwi.ifext asc");
         }, 
         (Body) () -> 
         {
            query0.next();
            message((character) new FieldReference(ttwi, "ifext", 0).getValue());
         }));

(the sort is "ttwi.ifext asc" - uses CONTAINS argument).
This doesn't look consistent, and in the first case simply not logical.

#300 Updated by Greg Shah about 3 years ago

Is 1587b safe for merge into 3821c? Does it work properly for the large customer application which you tested? We are about to reset our testing environments for that customer application which means a database import will be done anyway. So a merge now may reduce the number of database resets that we need to do.

#301 Updated by Igor Skornyakov about 3 years ago

Greg Shah wrote:

Is 1587b safe for merge into 3821c? Does it work properly for the large customer application which you tested? We are about to reset our testing environments for that customer application which means a database import will be done anyway. So a merge now may reduce the number of database resets that we need to do.

There are two things that can cause a problem:
  1. temp-tables with word index for an extent field (should be fixed today, not sure if such tables are used in the real customer apps)
  2. uppercasing with LATIN1 database (#5082). This is a real problem with at least one large customer app.

Another thing is ordering imposed by CONTAINS. It is unclear now how fast it can be resolved, but the current version of FWD is not compatible with 4GL anyway.

#302 Updated by Greg Shah about 3 years ago

uppercasing with LATIN1 database (#5082). This is a real problem with at least one large customer app.

Does 1587b cause this to be an issue where it is not an issue today?

Another thing is ordering imposed by CONTAINS. It is unclear now how fast it can be resolved, but the current version of FWD is not compatible with 4GL anyway.

If 1587b is not worse than 3821c, then it should be merged.

#303 Updated by Igor Skornyakov about 3 years ago

Greg Shah wrote:

uppercasing with LATIN1 database (#5082). This is a real problem with at least one large customer app.

Does 1587b cause this to be an issue where it is not an issue today?

With 1587b we have a problem with import since we put words converted to uppercase to the word table. Previously we had no problems with import but potentially can have it at runtime with some values of the CONTAINS expression. Please note that the latter can happen in a very rare situation since there are only a few records that can cause this. (~10 records of ~250,000 for two fields in a single table).

#304 Updated by Igor Skornyakov about 3 years ago

Igor Skornyakov wrote:

There are two things that can cause a problem:
  1. temp-tables with word index for an extent field (should be fixed today, not sure if such tables are used in the real customer apps)

BTW: in this regard, 1587b is also at least not worse than 3821c since we still use UDF for the temp-tables. It is even better since contains a fixed version of the CONTAINS UDF and handles subscripted reference to the extent field properly. I've removed reporting of the 1688 warning (see #1587-226) at runtime but have not added a conversion-time warning. This should be a one-liner but I have not found the right place for this so far.

#305 Updated by Igor Skornyakov about 3 years ago

I understand that the incorrect ORDER BY component is added at the PreselectQuery.assembleHQL() method. The SortCriterion instance which cases the problem is:

    SortCriterion  (id=7908)    
    alias    "ttwi" (id=8079)    
    ascending    true    
    computedColumnPrefix    null    
    dialect    P2JH2Dialect  (id=7762)    
    dmoClass    Class<T> (com.goldencode.testcases.dmo._temp.Ttwi_1_1__Impl__) (id=1186)    
    ignoreCase    true    
    isCharacter    true    
    method    Method  (id=8082)    
    name    "ttwi.ifext" (id=8083)    
    originalName    "ttwi.ifext" (id=8083)    
    propertyName    "ifext" (id=8085)    
    subscript    -1 [0xffffffff]    

At this point it is unclear where this component comes from (it is added at the conversion)
Is it OK just to remove it just because the property is an extent (the method has an argument) and the subscript == -1? I understand that an array as a sort column doesn't make sense. Please note that this will affect not only queries with the CONTAINS operator.
Thank you.

#306 Updated by Eric Faulhaber about 3 years ago

We should not be adjusting for conversion errors at runtime. If something is incorrectly added by conversion, we should fix the conversion. In this case, we should not be sorting by an extent field, but I think that index selection is selecting the word index, and then downstream conversion is treating that as if it were a regular index on which to sort, regardless of the inclusion of an extent field in the index. This is disallowed for indices, so the sort phrase conversion is written with the assumption that extent fields will not be in the selected index. This is a flaw in the sort phrase conversion. I still don't know definitively what the correct sort behavior should be. Is this something we are able to know deterministically at conversion time, such that we can fashion a sort phrase?

#307 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

We should not be adjusting for conversion errors at runtime. If something is incorrectly added by conversion, we should fix the conversion. In this case, we should not be sorting by an extent field, but I think that index selection is selecting the word index, and then downstream conversion is treating that as if it were a regular index on which to sort, regardless of the inclusion of an extent field in the index. This is disallowed for indices, so the sort phrase conversion is written with the assumption that extent fields will not be in the selected index. This is a flaw in the sort phrase conversion. I still don't know definitively what the correct sort behavior should be. Is this something we are able to know deterministically at conversion time, such that we can fashion a sort phrase?

Eric,
I understand your point. However, the initial idea was not to change the conversion at all. And this particular problem is just about allowing merging 1587b to 3821c. I understand that Greg wants to do this before the right solution for sorting will be implemented.

#308 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I understand your point. However, the initial idea was not to change the conversion at all.

Yes, that's how we started, but that idea was based on having less information than we have now, before we knew we had an existing bug in conversion. There is no edict to keep from making conversion changes. That being said, I don't think this is the moment to focus on this fix.

And this particular problem is just about allowing merging 1587b to 3821c. I understand that Greg wants to do this before the right solution for sorting will be implemented.

I'm not sure how this issue is preventing merging 1587b into 3821c. I thought in #1587-304, you said that it is no worse in 1587b in this regard. AFAIK, this problem already exists in trunk. That is, if a query uses a word index which is based on an extent field, we already will have a broken sort clause, won't we? If you have found that the changes in 1587b actually make this problem somehow worse, and I have misunderstood #1587-304, then we can make a temporary runtime adjustment. Otherwise I would say, let's not let this issue hold back the merge.

That being said, I think the LATIN-1 import issue is a showstopper for the merge. Do you have any further ideas how to resolve that (even temporarily to allow the merge, so 1587b does not make the situation worse than in the current 3821c)?

#309 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

I understand your point. However, the initial idea was not to change the conversion at all.

Yes, that's how we started, but that idea was based on having less information than we have now, before we knew we had an existing bug in conversion. There is no edict to keep from making conversion changes. That being said, I don't think this is the moment to focus on this fix.

And this particular problem is just about allowing merging 1587b to 3821c. I understand that Greg wants to do this before the right solution for sorting will be implemented.

I'm not sure how this issue is preventing merging 1587b into 3821c. I thought in #1587-304, you said that it is no worse in 1587b in this regard. AFAIK, this problem already exists in trunk. That is, if a query uses a word index which is based on an extent field, we already will have a broken sort clause, won't we? If you have found that the changes in 1587b actually make this problem somehow worse, and I have misunderstood #1587-304, then we can make a temporary runtime adjustment. Otherwise I would say, let's not let this issue hold back the merge.

That being said, I think the LATIN-1 import issue is a showstopper for the merge. Do you have any further ideas how to resolve that (even temporarily to allow the merge, so 1587b does not make the situation worse than in the current 3821c)?

Well, if changes in the conversion are allowed I can implement a workaround for import by converting to uppercase at the database side (see #1587b-240). I do not think that it should be difficult.

#310 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Well, if changes in the conversion are allowed I can implement a workaround for import by converting to uppercase at the database side (see #1587-240). I do not think that it should be difficult.

This would be limited to just the word tables, correct? Currently, we don't manipulate the text data for any other reasons that I recall, other than what may happen naturally reading into and writing out of Java.

Sorry, I'm not clear on how this changes conversion. Does it impact the schema/DDL?

#311 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

Well, if changes in the conversion are allowed I can implement a workaround for import by converting to uppercase at the database side (see #1587b-240). I do not think that it should be difficult.

This would be limited to just the word tables, correct? Currently, we don't manipulate the text data for any other reasons that I recall, other than what may happen naturally reading into and writing out of Java.

Sorry, I'm not clear on how this changes conversion. Does it impact the schema/DDL?

This will require a simple change in the import. But, to make the word table work correctly we need to make a change in the conversion of the CONTAINS to UDF call by removing the conversion of the expression to uppercase in Java code (and string literals uppercasing during conversion). That's it. The schema/DDL will be the same but the Java code will be changed.

#312 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

This will require a simple change in the import. But, to make the word table work correctly we need to make a change in the conversion of the CONTAINS to UDF call by removing the conversion of the expression to uppercase in Java code (and string literals uppercasing during conversion). That's it. The schema/DDL will be the same but the Java code will be changed.

Is the import change (to uppercase in the database) safe across all dialects we currently support? If so, please proceed. Please leave enough information in the code so that we know we have to come back around and find a more permanent solution. Thanks.

#313 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Is the import change (to uppercase in the database) safe across all dialects we currently support? If so, please proceed. Please leave enough information in the code so that we know we have to come back around and find a more permanent solution. Thanks.

Yes, I think that it is dialect-neutral since UPPER is a standard SQL function.
Thank you. Hopefully, I will do it tomorrow, or at the weekend at the latest.

#314 Updated by Igor Skornyakov about 3 years ago

1. Removed warning 1688 at runtime
2. Added support for word index on extent field (PostgreSQL and H2, permanent tables)
3. Fixed uppercasing on DDL generation, import, and runtime for CONTAINS support

Committed to 1587b revision 11926.

#315 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

1. Removed warning 1688 at runtime
2. Added support for word index on extent field (PostgreSQL and H2, permanent tables)
3. Fixed uppercasing on DDL generation, import, and runtime for CONTAINS support

Committed to 1587b revision 11926.

The changes in rev 11926 look good to me.

Have you tested the import with the customer application which previously was failing with the encoding issue (at least up through the first instance of failure)? If so, and the failure is worked around, please proceed with the rebase and merge of 1587b to 3821c.

#316 Updated by Igor Skornyakov about 3 years ago

Branch 1587b was rebased to 3821c rev. 11968

Pushed up to revision 11993.

The only thing from the planned ones, for now, is removing the CONTAINS operator expression uppercasing on conversion. For LATIN1 it affects only cases when this expression contains "LANTIN1 "micro" (µ). I understand that with 3821c this will vase a runtime exception while with 1587b will result in an incorrect result set (some records may be missed).

#317 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Have you tested the import with the customer application which previously was failing with the encoding issue (at least up through the first instance of failure)? If so, and the failure is worked around, please proceed with the rebase and merge of 1587b to 3821c.

Yes, I've tested it and it works fine now. Merging...

#318 Updated by Igor Skornyakov about 3 years ago

Branch 1587b was merged into 3821c revision 11969.

Can I continue using 1587b or should create a new branch from 3821c?
Thank you.

#319 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Can I continue using 1587b or should create a new branch from 3821c?

You should not use 1587b. I would prefer you commit the changes to fix the remaining issues directly into 3821c, unless you feel the changes are especially risky or voluminous.

#320 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

Can I continue using 1587b or should create a new branch from 3821c?

You should not use 1587b. I would prefer you commit the changes to fix the remaining issues directly into 3821c, unless you feel the changes are especially risky or voluminous.

I see, thank you. The remaining changes should be pretty small. So I will work with 3821c.

#321 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

The only thing from the planned ones, for now, is removing the CONTAINS operator expression uppercasing on conversion. For LATIN1 it affects only cases when this expression contains "LANTIN1 "micro" (µ). I understand that with 3821c this will vase a runtime exception while with 1587b will result in an incorrect result set (some records may be missed).

You are referring to the "search-expression" portion of the CONTAINS operation, correct? Are we doing this uppercasing in TRPL code?

#322 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

You are referring to the "search-expression" portion of the CONTAINS operation, correct? Are we doing this uppercasing in TRPL code?

Yes, I'm talking about "search expression". I've not yet found an exact place where it is done. There are two different cases - uppercasing the string literal and uppercasing a query parameter. I also want to remove the field from the sort argument since it is completely wrong.

#323 Updated by Igor Skornyakov about 3 years ago

I've just realized that the trick with uppercasing at the database side required additional precautions. The PK/unique index violation can happen when the case-insensitive field with word index on it contains words that are different only in case.
I've implemented the fix for this situation and testing it with a large customer database. It will take some time.

#324 Updated by Igor Skornyakov about 3 years ago

Igor Skornyakov wrote:

I've just realized that the trick with uppercasing at the database side required additional precautions. The PK/unique index violation can happen when the case-insensitive field with word index on it contains words that are different only in case.
I've implemented the fix for this situation and testing it with a large customer database. It will take some time.

Fixed and tested.
Committed to 3821c rev. 11976

#325 Updated by Igor Skornyakov about 3 years ago

'Word indices are not supported' warning is suppressed in 3821c/11990

#326 Updated by Igor Skornyakov about 3 years ago

Word tables' scripts are idempotent now (can be run multiple times). This allows applying changes in constraints/indices/triggers w/o full database re-import.
There is no need to use dialect-specific DDLs since all databases supported by FWD now support IF EXISTS clause in the DROP statement.
Committed to 3821c/11992.

#327 Updated by Igor Skornyakov about 3 years ago

Remaining work for the word indices support:
  1. Fix regressions introduced by word tables' support (if found).
  2. Remove uppercasing of the CONTAINS condition on conversion (#1587-316).
  3. Remove CONTAINS field from the generated sort parameter of the AdaptiveQuery.initialize() (#1587-299).
  4. Add 1688 warning on conversion (#1587-226).
  5. Implement performance tests.
  6. Implement 4GL-compatible implicit sorting imposed by CONTAINS.
  7. Implement word tables support for temp-tables
  8. Implement word tables support for MS SQL Server.
  9. Implement CONTAINS support for denormalized extent fields.

I suggest addressing the subtasks listed above in this order with the first one at the highest priority. Hope to finish with conversion-related ones today or at the weekend if no serious regressions will be found.

#328 Updated by Eric Faulhaber about 3 years ago

  • Estimated time deleted (104.00)

One additional item, which I'll take as my responsibility, is to test how a word index plays into validation/flushing and to implement this behavior accordingly. Currently, word indices are skipped when setting up the BaseRecord bit sets which are used for tracking validation and flushing.

#329 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

One additional item, which I'll take as my responsibility, is to test how a word index plays into validation/flushing and to implement this behavior accordingly. Currently, word indices are skipped when setting up the BaseRecord bit sets which are used for tracking validation and flushing.

But the word tables are opaque for the application since they are updated by AFTER database triggers.

#330 Updated by Eric Faulhaber about 3 years ago

This isn't about the word tables. I have to determine whether a word index comes into play with DMO validation and flushing. If so, I need to ensure that validation (in the case of a unique word index, if such a thing is possible) and flushing of a new or updated record with a word index takes place at the appropriate moment, like we do with any other index.

#331 Updated by Igor Skornyakov about 3 years ago

Added a subtask to #1587-327 which I've forgot to mention (Implement CONTAINS support for denormalized extent fields).

#332 Updated by Igor Skornyakov about 3 years ago

Uppercasing the CONTAINS arguments at conversion time removed in #3821/11998.

#333 Updated by Igor Skornyakov about 3 years ago

I have problems with removing CONTAINS field from the generated 'sort' parameter of the AdaptiveQuery.initialize() (#1587-299).
Can anybody advise me where I have to start looking?
Thank you.

#334 Updated by Igor Skornyakov about 3 years ago

Added runtime uppercasing of the CONTAINS UDF arguments (for temp-tables).
Committed to 3821c/11999.

#335 Updated by Ovidiu Maxiniuc about 3 years ago

Igor Skornyakov wrote:

I have problems with removing CONTAINS field from the generated 'sort' parameter of the AdaptiveQuery.initialize() (#1587-299).
Can anybody advise me where I have to start looking?

The sort parameter is computed in annotations/index_selection.rules. Look for the orderBy variable (then, eventually compositeOrderBy). Be careful, since the queries can be nested, this string variable is scoped in recph dictionary, one for each nesting level.

#336 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

The sort parameter is computed in annotations/index_selection.rules. Look for the orderBy variable (then, eventually compositeOrderBy). Be careful, since the queries can be nested, this string variable is scoped in recph dictionary, one for each nesting level.

Thank you! I've found these variables. Trying to understand the logic.

#337 Updated by Igor Skornyakov about 3 years ago

Suppressed using word index field for sorting.
Committed to 3821c revision 12004.

#338 Updated by Eric Faulhaber about 3 years ago

Ovidiu Maxiniuc wrote:

Igor Skornyakov wrote:

I have problems with removing CONTAINS field from the generated 'sort' parameter of the AdaptiveQuery.initialize() (#1587-299).
Can anybody advise me where I have to start looking?

The sort parameter is computed in annotations/index_selection.rules. Look for the orderBy variable (then, eventually compositeOrderBy). Be careful, since the queries can be nested, this string variable is scoped in recph dictionary, one for each nesting level.

I understand we need to correct the sort behavior for queries which use word indices, but I don't think the way to do this is by ignoring word indices in index_selection.rules, which is what the fix seems to be doing. This will cause a different, incorrect index to be chosen for many tables, which may guarantee the sort is wrong.

Ignoring a word index in index selection has downstream implications beyond sorting. For instance, the INDEX-INFORMATION attribute (to the extent it is implemented correctly today) will report the wrong information for any query which should have chosen a word index.

Furthermore, I don't think the correct fix is to simply disable sorting and leave it to the RDBMS. The 4GL most likely has deterministic behavior in this regard. We need to understand what that is (not just what it is not) and implement that behavior, if at all possible.

What is the implication of temporarily leaving the behavior the way it was before 3821c/12004? I realize the wrong index was being selected for the first example in in #1587-299, but it seems that 3821c/12004 would now cause the wrong index to be selected for both examples. I suspect the existing error is more about the extent field than the word index. Is the conversion correct (before rev 12004) for both examples if the word-indexed field is not an extent field?

#339 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

I understand we need to correct the sort behavior for queries which use word indices, but I don't think the way to do this is by ignoring word indices in index_selection.rules, which is what the fix seems to be doing. This will cause a different, incorrect index to be chosen for many tables, which may guarantee the sort is wrong.

Ignoring a word index in index selection has downstream implications beyond sorting. For instance, the INDEX-INFORMATION attribute (to the extent it is implemented correctly today) will report the wrong information for any query which should have chosen a word index.

Furthermore, I don't think the correct fix is to simply disable sorting and leave it to the RDBMS. The 4GL most likely has deterministic behavior in this regard. We need to understand what that is (not just what it is not) and implement that behavior, if at all possible.

What is the implication of temporarily leaving the behavior the way it was before 3821c/12004? I realize the wrong index was being selected for the first example in in #1587-299, but it seems that 3821c/12004 would now cause the wrong index to be selected for both examples. I suspect the existing error is more about the extent field than the word index. Is the conversion correct (before rev 12004) for both examples if the word-indexed field is not an extent field?

The main motivation for the changes was to deal with the situation when word index on extent field results in the generation of the incorrect SQL statement and exception. If you believe that it is already fixed now, then my change is not absolutely necessary. Please note however that, as far as I understand
  1. word index is used for implicit sorting only in the presence of the CONTAINS operator for the corresponding field
  2. using word index for implicit sorting does not mean using the corresponding field in the ORDER BY clause of the executed SQL statement.
  3. in my tests after the changes the primary index was selected, so the behavior is still deterministic

The correct implicit sorting in the presence of the CONTAINS operator is not implemented yet, but at this moment I think that it should be done at runtime only. In any case, I do not see any reason to retain the presence of the word index field in the ORDER BY query since it simply doesn't make sense.

#340 Updated by Igor Skornyakov about 3 years ago

Here is the suggested structure of the test table for performance testing of the new CONTAINS support:
The test table definition is:

ADD TABLE "data" 
  AREA "Schema Area" 
  DUMP-NAME "data" 

ADD FIELD "pk" OF "data" AS integer 
  FORMAT "->,>>>,>>9" 
  INITIAL "0" 
  POSITION 2
  MAX-WIDTH 4
  ORDER 10
  MANDATORY

ADD FIELD "words" OF "data" AS character 
  FORMAT "x(1024)" 
  INITIAL "" 
  POSITION 3
  MAX-WIDTH 2048
  ORDER 20
  MANDATORY

ADD INDEX "pk-idx" ON "data" 
  AREA "Schema Area" 
  UNIQUE
  PRIMARY
  INDEX-FIELD "pk" ASCENDING 

ADD INDEX "words-idx" ON "data" 
  AREA "Schema Area" 
  WORD
  INDEX-FIELD "words" ASCENDING 

Consider a number N > 0. The values of the pk field are numbers 0..2^N-1. The value of the words field with pk == n is the following:
Let n = 2^k1 + 2^k2 + ... + 2^kl - a binary representation of n. Then the value of words is a space separated "words" String.valueOf(k[i]), i = 1..l.
This means that the word index is created for a field containing all possible subsets of the set of n "words@.
This allows creating queries with different (and easily predictable) result sets. The structure of the data is a good model of any real data if we ignore the performance implications of the long words and the duplicated set of words that can be found in 'real' data. In some well-defined sense, it represents a "maximal" version of a collection of records with a given set of words.

With N = 20 we will have a table with more than a million records that look much enough to reason about the performance of the approach.

#341 Updated by Eric Faulhaber about 3 years ago

There are two aspects of performance we need to understand and measure from this testing:

  • read (upon which the correct sorting behavior will have an impact)
  • insert/update

If I understand correctly, for N = 20, the minimum number of words in any given record using the proposed approach is 1 and the maximum number of words in a record is 20. Is that correct?

It seems like the proposed approach will stress the "raw" read aspect. However, the cost of sorting is unknown at this point, since we need to understand the legacy sorting behavior first before we fully understand how best to test it.

I'm not so sure this approach will stress the insert/update aspect, with no more than 20 words (in this example) to parse and organize into the words table per insert/update. I expect there are applications which will have much larger records than this.

One more thing to consider is that we've already encountered real world use of word indexed extent fields. This should be reflected in the testing as well.

#342 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

There are two aspects of performance we need to understand and measure from this testing:

  • read (upon which the correct sorting behavior will have an impact)
  • insert/update

If I understand correctly, for N = 20, the minimum number of words in any given record using the proposed approach is 1 and the maximum number of words in a record is 20. Is that correct?

Yes, this is correct.

It seems like the proposed approach will stress the "raw" read aspect. However, the cost of sorting is unknown at this point, since we need to understand the legacy sorting behavior first before we fully understand how best to test it.

I agree this requires additional analysis.

I'm not so sure this approach will stress the insert/update aspect, with no more than 20 words (in this example) to parse and organize into the words table per insert/update. I expect there are applications which will have much larger records than this.

I do not mean to say that the suggested approach provides a complete picture. It can be modified. Regarding the insert/update. The update is more expensive but its cost can be estimated just by appending a space to the field containing words for some subset of records.

One more thing to consider is that we've already encountered real world use of word indexed extent fields. This should be reflected in the testing as well.

I understand this. But I suggest starting with generated data because in this case, we will understand the structure of the data much better.

#343 Updated by Igor Skornyakov about 3 years ago

I've noticed that for word index with customized extent fields the value of the name attribute of the IndexComponent annotation is the name of the custom name of the first component of the extent. I know how to fix it and implemented the fix for the extent field for the word index. However, I think that if the index component of the non-word index can be an extent field, the situation will be the same. Is it OK?
Thank you.

#344 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I've noticed that for word index with customized extent fields the value of the name attribute of the IndexComponent annotation is the name of the custom name of the first component of the extent. I know how to fix it and implemented the fix for the extent field for the word index. However, I think that if the index component of the non-word index can be an extent field, the situation will be the same. Is it OK?
Thank you.

I'm having trouble understanding the question. Please provide an example.

#345 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

I'm having trouble understanding the question. Please provide an example.

I mean the following situation:
Consider the table definition:

ADD TABLE "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  DUMP-NAME "ttwi" 

ADD FIELD "if1abcdefghijklmnopqrstuvwxyxz" OF "ttwiabcdefghijklmnopqrstuvwxyxz" AS character 
  FORMAT "x(128)" 
  INITIAL "" 
  POSITION 2
  MAX-WIDTH 256
  ORDER 10

ADD FIELD "if2abcdefghijklmnopqrstuvwxyxz" OF "ttwiabcdefghijklmnopqrstuvwxyxz" AS character 
  FORMAT "x(8)" 
  INITIAL "" 
  POSITION 3
  MAX-WIDTH 16
  ORDER 20
  CASE-SENSITIVE

ADD FIELD "f-ext" OF "ttwiabcdefghijklmnopqrstuvwxyxz" AS character 
  FORMAT "x(50)" 
  INITIAL "" 
  POSITION 4
  MAX-WIDTH 510
  EXTENT 5
  ORDER 30

  ADD INDEX "wi1" ON "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  WORD
  INDEX-FIELD "if1abcdefghijklmnopqrstuvwxyxz" ASCENDING 

ADD INDEX "wi2" ON "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  INACTIVE
  WORD
  INDEX-FIELD "if2abcdefghijklmnopqrstuvwxyxz" ASCENDING 

ADD INDEX "wiext" ON "ttwiabcdefghijklmnopqrstuvwxyxz" 
  AREA "Schema Area" 
  INACTIVE
  WORD
  INDEX-FIELD "f-ext" ASCENDING 

and the hints:

      <table name="ttwiabcdefghijklmnopqrstuvwxyxz"> 
         <custom-extent> 
            <field name="f-ext"> 
               <extent-field name = "fextFirst" label = "label of fextFirst text"/> 
               <extent-field name = "fextSecond" label = "label of fextSecond text"/> 
               <extent-field name = "fextThird" label = "label of fextThird text"/> 
               <extent-field name = "fextFourth" label = "label of fextFourth text"/> 
               <extent-field name = "fextFifth" label = "label of fextFifth text"/> 
            </field> 
         </custom-extent> 
      </table> 

Then the generated annotation is:

@Table(name = "ttwiabcdefghijklmnopqrstuvwxyxz", legacy = "ttwiabcdefghijklmnopqrstuvwxyxz")
@Indices(
{
   @Index(name = "ttwiabcdefghijklmnopqrstuvwxyxz_wi1", legacy = "wi1", word = true, components = 
   {
      @IndexComponent(name = "if1abcdefghijklmnopqrstuvwxyxz", legacy = "if1abcdefghijklmnopqrstuvwxyxz")
   }, wordtablename = "ttwiabcdefghijklmnopqrstuvwxyxz__if1abcdefghijklmnopqrstuvwxyxz"),
   @Index(name = "ttwiabcdefghijklmnopqrstuvwxyxz_wi2", legacy = "wi2", word = true, components = 
   {
      @IndexComponent(name = "if2abcdefghijklmnopqrstuvwxyxz", legacy = "if2abcdefghijklmnopqrstuvwxyxz")
   }, wordtablename = "ttwiabcdefghijklmnopqrstuvwxyxz__if2abcdefghijklmnopqrstuvwxyxz"),
   @Index(name = "ttwiabcdefghijklmnopqrstuvwxyxz_wiext", legacy = "wiext", word = true, components = 
   {
      @IndexComponent(name = "fextFirst", legacy = "f-ext")
   }, wordtablename = "ttwiabcdefghijklmnopqrstuvwxyxz__f_ext")
})

I'm talking about name = "fextFirst", legacy = "f-ext". The word table name was also based on the name of the first field of the extent nut I've fixed it earlier.
In fact, it seems that the extent field can be a component of the word index only, so my initial question doesn't make sense, sorry fo disrurbing.

#346 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I'm talking about name = "fextFirst", legacy = "f-ext". The word table name was also based on the name of the first field of the extent nut I've fixed it earlier.
In fact, it seems that the extent field can be a component of the word index only, so my initial question doesn't make sense, sorry fo disrurbing.

No problem. The annotation for the wiext index looks correct to me as is.

#347 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

No problem. The annotation for the wiext index looks correct to me as is.

Well, formally it looks confusing, but it doesn't matter since the word index is not a regular index.

#348 Updated by Igor Skornyakov about 3 years ago

I've noticed two things regarding custom extends.
  1. If the number of the <extent-field subnodes on the .hints file is less than the extent size the warning WARNING: Field '<name>' has <N> hinted names, less than <M> in extent is emitted but after the conversion fails with IndexOutOfBoundsException on extentHintField = customExtentList.get(denormCounter1 - 1) in p2o.xml.
  2. if there are no <extent-field subnodes then there is no map for the table in the TableHints.customExtents at all.

Is it expected behavior?
I'm using TableHints.customExtents extensively for the word tables' support since I need the actual size of the extent and field names.
Should I start looking for another approach or it is possible to populate TableHints.customExtents in the absence of the <extent-field subnodes?
Thank you,

#349 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I've noticed two things regarding custom extends.
  1. If the number of the <extent-field subnodes on the .hints file is less than the extent size the warning WARNING: Field '<name>' has <N> hinted names, less than <M> in extent is emitted but after the conversion fails with IndexOutOfBoundsException on extentHintField = customExtentList.get(denormCounter1 - 1) in p2o.xml.
  2. if there are no <extent-field subnodes then there is no map for the table in the TableHints.customExtents at all.

Is it expected behavior?
I'm using TableHints.customExtents extensively for the word tables' support since I need the actual size of the extent and field names.
Should I start looking for another approach or it is possible to populate TableHints.customExtents in the absence of the <extent-field subnodes?
Thank you,

I'm not sure if that is expected behavior. The supported hint syntax is documented in #2134-86. To the degree the implementation does not match this documentation, or error handling is not implemented, we can consider this a bug. I am not fully familiar with the internals of TableHints as it relates to custom extents, as I did not implement that feature. If some required information is missing, you can add it, but please note that making the syntax foolproof is beyond the scope of this task; please spend the minimum amount of time fixing the hint implementation that is needed to get to the end of this task.

#350 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

I've noticed two things regarding custom extends.
  1. If the number of the <extent-field subnodes on the .hints file is less than the extent size the warning WARNING: Field '<name>' has <N> hinted names, less than <M> in extent is emitted but after the conversion fails with IndexOutOfBoundsException on extentHintField = customExtentList.get(denormCounter1 - 1) in p2o.xml.
  2. if there are no <extent-field subnodes then there is no map for the table in the TableHints.customExtents at all.

Is it expected behavior?
I'm using TableHints.customExtents extensively for the word tables' support since I need the actual size of the extent and field names.
Should I start looking for another approach or it is possible to populate TableHints.customExtents in the absence of the <extent-field subnodes?
Thank you,

I'm not sure if that is expected behavior. The supported hint syntax is documented in #2134-86. To the degree the implementation does not match this documentation, or error handling is not implemented, we can consider this a bug. I am not fully familiar with the internals of TableHints as it relates to custom extents, as I did not implement that feature. If some required information is missing, you can add it, but please note that making the syntax foolproof is beyond the scope of this task; please spend the minimum amount of time fixing the hint implementation that is needed to get to the end of this task.

I see, thank you. I believe that I'm very close to the finish.

#351 Updated by Igor Skornyakov about 3 years ago

I've resolved the issues described in #1587-349, implemented the generation of the word tables with indices/constraints, and import for the word indices on denormalized extent fields.
What remains is the generation of triggers and SQL rewriting. There should be a separate ON UPDATE trigger for each field in the denormalized extent. I've designed and tested them for the PostgreSQL dialect.
Consider the table

ADD TABLE "test-words" 
  AREA "Schema Area" 
  DUMP-NAME "twords" 

ADD FIELD "f-ext" OF "test-words" AS character 
  FORMAT "x(50)" 
  INITIAL "" 
  POSITION 4
  MAX-WIDTH 510
  EXTENT 5
  ORDER 30

ADD INDEX "wiext" ON "test-words" 
  AREA "Schema Area" 
  INACTIVE
  WORD
  INDEX-FIELD "f-ext" ASCENDING 

with default custom extent:

      <table name="test-words"> 
         <custom-extent/> 
      </table> 

Then the triggers' DDLs are (the ON UPDATE trigger is shown only for the first field of the extent):

create or replace function test_words__f_ext1__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from test_words__f_ext where test_words__f_ext.parent__id = new.recid and test_words__f_ext.list__index = 1;
    insert into test_words__f_ext select * from words(new.recid, 1, new.f_ext1, true);
    return new;
end;
$$;

create trigger test_words__f_ext1__upd after
update of f_ext1
    on
    test_words for each row execute procedure test_words__f_ext1__trg();

create or replace function test_words__f_ext__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from test_words__f_ext where test_words__f_ext.parent__id = new.recid;
    insert into test_words__f_ext select * from words(new.recid, 1, new.f_ext1, true);
    insert into test_words__f_ext select * from words(new.recid, 2, new.f_ext2, true);
    insert into test_words__f_ext select * from words(new.recid, 3, new.f_ext3, true);
    insert into test_words__f_ext select * from words(new.recid, 4, new.f_ext4, true);
    insert into test_words__f_ext select * from words(new.recid, 5, new.f_ext5, true);
    return new;
end;
$$;

create trigger test_words__f_ext__ins after
insert
    on
    test_words for each row execute procedure test_words__f_ext__trg();

For H2 it should be similar. I will add the generation and SQL-rewriting tomorrow.

#352 Updated by Igor Skornyakov about 3 years ago

As I can see at this moment FWD has a dependency on Apache Velocity. Can I use it for DDL generation? This will make the code more compact, readable, and template-driven.
Thank you.

#353 Updated by Eric Faulhaber about 3 years ago

Please don't. If we were starting from scratch, we could consider it, but the changes you should be making to DDL generation at this point should be incremental. I don't want to add a framework dependency to this area just for word index support.

#354 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Please don't. If we were starting from scratch, we could consider it, but the changes you should be making to DDL generation at this point should be incremental. I don't want to add a framework dependency to this area just for word index support.

OK. Thank you.

#355 Updated by Igor Skornyakov about 3 years ago

Added word tables' support for custom extents (conversion and database import).
Committed to 3821c/12031

#356 Updated by Igor Skornyakov about 3 years ago

Finished with word tables support for denormalized extent fields. Tested with a large customer app. and with test app for H2/PostgreSQL both for normalized and denormalized extents.
Committed to 3821c/12040.

#357 Updated by Igor Skornyakov about 3 years ago

For the performance testing, I've used a modified approach described at #1587-340. Instead of generated "word" as all subsets of 1..n I used a union of nom-empty subsets of 1..k where k = 1..n with different one-letter prefixes.
The beginning of the generated table looks like following:

1, "a0" 
1, "b0" 
2, "b1" 
3, "b0 b1" 
1, "c0" 
2, "c1" 
3, "c0 c1" 
4, "c2" 
5, "c0 c2" 
6, "c1 c2" 
7, "c0 c1 c2" 
1, "d0" 
2, "d1" 
3, "d0 d1" 
4, "d2" 
5, "d0 d2" 
6, "d1 d2" 
7, "d0 d1 d2" 
8, "d3" 
9, "d0 d3" 
10, "d1 d3" 
11, "d0 d1 d3" 
12, "d2 d3" 
13, "d0 d2 d3" 
14, "d1 d2 d3" 
15, "d0 d1 d2 d3" 

The table definition is:

ADD TABLE "words" 
  AREA "Schema Area" 
  DUMP-NAME "words" 

ADD FIELD "recno" OF "words" AS integer 
  FORMAT "->,>>>,>>9" 
  INITIAL "0" 
  POSITION 2
  MAX-WIDTH 4
  ORDER 10
  MANDATORY

ADD FIELD "words" OF "words" AS character 
  FORMAT "x(1000)" 
  INITIAL "" 
  POSITION 3
  MAX-WIDTH 200
  ORDER 20
  MANDATORY

ADD INDEX "pk" ON "words" 
  AREA "Schema Area" 
  PRIMARY
  INDEX-FIELD "recno" ASCENDING 

ADD INDEX "words" ON "words" 
  AREA "Schema Area" 
  WORD
  INDEX-FIELD "words" ASCENDING 

The generated table contains 1,048,555 records with prefixes 'a'..'s'

The test program is:

OUTPUT TO words.txt.

RUN test ("a0").
RUN test ("b0").
RUN test ("c0").
RUN test ("g0").
RUN test ("g0 & g1").
RUN test ("h0").
RUN test ("h0 & h1").
RUN test ("h0 & h1 & h2").
RUN test ("r0").
RUN test ("r0 & r1").
RUN test ("r0 & r1 & r2").
RUN test ("s0").
RUN test ("s0 & s1").
RUN test ("s0 & s1 & s2").
RUN test ("t0").
RUN test ("t0 & t1").
RUN test ("t0 & t1 & t2").

OUTPUT close.

PROCEDURE test:
    DEF INPUT PARAM expr AS CHAR.
    define var dt1 as datetime no-undo.
    define var dt2 as datetime no-undo.
    def var n as int no-undo.

    dt1 = now.
    n = 0.
    for each words where (words contains expr):
        n = n + 1.
    end.
    dt2 = now.
    message "[" + expr + "]:" "records:" n "elapsed time:" interval( dt2, dt1, "milliseconds").

END.

The code was changed to allow to switch between UDF-based and word tables-based support of the CONTAINS operator.
The test results are:
expr records elapsed time (word tables) elapsed time (UDF elapsed time (4GL file database)
'a0' 1 549 5862 0
'b0' 2 496 5641 0
'c0' 4 473 5604 0
'g0' 64 777 5582 0
'g0 & g1' 32 610 5857 0
'h0' 128 773 5738 1
'h0 & h1' 64 620 5772 0
'h0 & h1 & h2' 32 605 5775 1
'r0' 131072 13476 17352 481
'r0 & r1' 65536 8676 12719 554
'r0 & r1 & r2' 32768 5966 11088 501
's0' 262144 24191 27298 994
's0 & s1' 131072 15831 17503 2132
's0 & s1 & s2' 65536 10600 14324 983
't0' 0 363 5624 0
't0 & t1' 0 204 5711 0
't0 & t1 & t2' 0 201 5868 0

As one can see for small result sets the queries using word tables are ~10 times faster than with UDF. With large result sets the difference is not very noticeable. It seems that FWD fetches large result sets not very efficient.

For completeness lets look at the PosgreSQL execution plan for expr == 's0'.
The corresponding re-written SQL query is:

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('s0')))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

The execution plan:

Node Type    Entity    Cost    Rows    Time    Condition
Sort    [NULL]    259360.58 - 260003.10    262144    1065.260    [NULL]
    Hash Join    [NULL]    197794.13 - 231874.14    262144    986.384    (words__imp0_.recid = t.recid)
        Seq Scan    words    0.00 - 20083.55    1048555    86.163    [NULL]
    Hash    [NULL]    193577.54 - 193577.54    262144    696.488    [NULL]
        Unique    [NULL]    189722.43 - 191007.47    262144    670.711    [NULL]
            Sort    [NULL]    189722.43 - 190364.95    262144    648.202    [NULL]
                Hash Join    [NULL]    131082.47 - 163114.49    262144    572.115    (t.recid = w1.parent__id)
                    Seq Scan    words    0.00 - 20083.55    1048555    79.537    [NULL]
                Hash    [NULL]    126865.89 - 126865.89    262144    284.135    [NULL]
                    Gather    [NULL]    1000.00 - 126865.89    262144    255.671    [NULL]
                        Seq Scan    words__words    0.00 - 100165.19    87381    244.927    (word = 'S0'::text)

Of course with UDF we always have a plain sequential scan.

#358 Updated by Eric Faulhaber about 3 years ago

Igor, while the improvement over the UDF approach is considerable in some cases, the new approach is still not as performant as one would like. The query plan PostgreSQL has chosen seems quite inefficient. Can you think of any change to the query itself or to the indices on the affected tables which could allow a more efficient query plan?

What are the indices currently defined on all the involved tables?

#359 Updated by Eric Faulhaber about 3 years ago

Igor, is this test code (including the code which generates the test data) checked in?

#360 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor, while the improvement over the UDF approach is considerable in some cases, the new approach is still not as performant as one would like. The query plan PostgreSQL has chosen seems quite inefficient. Can you think of any change to the query itself or to the indices on the affected tables which could allow a more efficient query plan?

Yes, this is what I plan to work on today.

What are the indices currently defined on all the involved tables?

alter table words__words
   drop constraint if exists pk__words__words;

alter table words__words
   add constraint pk__words__words
   primary key (parent__id, word);

drop index if exists fkidx__words__words;

create index fkidx__words__words on words__words (parent__id);

alter table words__words
   drop constraint if exists fk__words__words;

alter table words__words
   add constraint fk__words__words
   foreign key (parent__id)
   references words
   on delete cascade
   on update cascade;

I plan to add an index on the word field and see how it will help.

#361 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor, is this test code (including the code which generates the test data) checked in?

Not yet. Will be done.
BTW. I see no place for the java code in the testcases project. Can I add the java folder and put the code which generates the test data into it?
Thank you.

#362 Updated by Igor Skornyakov about 3 years ago

Word tables' tests to sftp://xfer.goldencode.com/opt/testcases/ revision 1008.
Tests (including Java project for generating test data) are in the words subfolder. The test database is fwd1.

#363 Updated by Igor Skornyakov about 3 years ago

Adding index on the 'word' field in the word tables helps. The times are (see #1587-357)

expr records elapsed time
"a0" 1 160
"b0" 2 170
"c0" 4 180
"g0" 64 275
"g0 & g1" 32 10
"h0" 128 274
"h0 & h1" 64 13
"h0 & h1 & h2" 32 10
"r0" 131072 11582
"r0 & r1" 65536 6779
"r0 & r1 & r2" 32768 4266
"s0" 262144 22520
"s0 & s1" 131072 13339
"s0 & s1 & s2" 65536 8997
"t0" 0 229
"t0 & t1" 0 1
"t0 & t1 & t2" 0 2

The execution plan for a simple expression is:

Gather Merge    [NULL]    162210.43 - 187229.01    262144    861.296    [NULL]
    Sort    [NULL]    161210.41 - 161478.44    87381    794.292    [NULL]
        Hash Join    [NULL]    130031.68 - 150418.51    87381    758.375    (words__imp0_.recid = t.recid)
            Seq Scan    words    0.00 - 13965.98    349518    31.578    [NULL]
        Hash    [NULL]    125809.23 - 125809.23    262144    637.009    [NULL]
            Unique    [NULL]    121949.49 - 123236.07    262144    612.167    [NULL]
                Sort    [NULL]    121949.49 - 122592.78    262144    590.802    [NULL]
                    Hash Join    [NULL]    63271.53 - 95304.55    262144    514.111    (t.recid = w1.parent__id)
                        Seq Scan    words    0.00 - 20082.55    1048555    86.334    [NULL]
                    Hash    [NULL]    59049.08 - 59049.08    262144    191.774    [NULL]
                        Bitmap Heap Scan    words__words    4818.63 - 59049.08    262144    156.787    [NULL]
                            Bitmap Index Scan    words__words_word_idx    0.00 - 4754.31    262144    21.142    (word = 'S0'::text)

#364 Updated by Igor Skornyakov about 3 years ago

With a logically equivalent but different re-written SQL statement using WITH clause:

with wr as (
     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('s0')))
)
select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ join wr on words__imp0_.recid = wr.recid 
order by
    words__imp0_.recno asc, words__imp0_.recid asc

The execution plan is:

Sort    [NULL]    202022.77 - 202666.06    262144    925.341    [NULL]
    Unique    [NULL]    121949.49 - 123236.07    262144    514.441    [NULL]
        Sort    [NULL]    121949.49 - 122592.78    262144    492.054    [NULL]
        Hash Join    [NULL]    63271.53 - 95304.55    262144    419.631    (t.recid = w1.parent__id)
            Seq Scan    words    0.00 - 20082.55    1048555    83.046    [NULL]
        Hash    [NULL]    59049.08 - 59049.08    262144    141.667    [NULL]
            Bitmap Heap Scan    words__words    4818.63 - 59049.08    262144    113.924    [NULL]
                Bitmap Index Scan    words__words_word_idx    0.00 - 4754.31    262144    12.942    (word = 'S0'::text)
Hash Join    [NULL]    38309.49 - 51263.26    262144    841.553    (wr.recid = words__imp0_.recid)
    CTE Scan    wr    0.00 - 5146.32    262144    542.592    [NULL]
    Hash    [NULL]    20082.55 - 20082.55    1048555    193.456    [NULL]
        Seq Scan    words    0.00 - 20082.55    1048555    89.333    [NULL]

#365 Updated by Igor Skornyakov about 3 years ago

I've added two things to the word tables support.
  1. The re-writing of the SQL queries with CONTAINS re-writing can be disabled at runtime using -DP2JPostgreSQLDialect.useUdf4Contains=true JVM parameter.
  2. An additional index on the 'word' field is generated. According to my tests, this makes queries with CONTAINS significantly faster (at least in some situations).
    Committed to 3821c rev. 12063.

At this moment I'm looking how to change SQL re-writing using WITH clause (see #1587-364).
This is tricky since it not only adds new data to the beginning of the query but also affects the already generated FROM part from the code processing the WHERE part.

#366 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

At this moment I'm looking how to change SQL re-writing using WITH clause (see #1587-364).
This is tricky since it not only adds new data to the beginning of the query but also affects the already generated FROM part from the code processing the WHERE part.

Do you see an advantage to this form of the SQL? It seemed to produce a less efficient query plan, according to your previous post.

#367 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

At this moment I'm looking how to change SQL re-writing using WITH clause (see #1587-364).
This is tricky since it not only adds new data to the beginning of the query but also affects the already generated FROM part from the code processing the WHERE part.

Do you see an advantage to this form of the SQL? It seemed to produce a less efficient query plan, according to your previous post.

I cannot say that the execution plan is less efficient. Please note also that apart from the sorting step (which should be changed anyway) it is a little faster.
Another thing is that, based on my experience, using WITH clause often works as a hint to a query optimizer for more complicated queries.
However, if you think that it doesn't make much sense I can postpone it until a more correct implicit sorting will be implemented and tested.

#368 Updated by Eric Faulhaber about 3 years ago

Yes, I was looking at the sort part of the plan, but you are right that the parts below do look somewhat faster. Please continue.

#369 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Yes, I was looking at the sort part of the plan, but you are right that the parts below do look somewhat faster. Please continue.

OK. Thank you.

#370 Updated by Eric Faulhaber about 3 years ago

Igor, can't the first form of the SQL be simplified as follows?

From:

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('s0')))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

to:

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct parent__id from words__words w1 where (w1.word = UPPER('s0'))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

In other words, why is the join within the subquery needed?

(LE: I forgot the from clause at first. Added that in an edit.)

#371 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor, can't the first form of the SQL be simplified as follows?
In other words, why is the join within the subquery needed?

When the CNF of the expression contains only one AND term such a simplification is possible. However, in more complicated cases it is not possible.
For example for expr == "s0 & s1":

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('s0')))
        join words__words as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('s1')))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

It is possible of course to implement simplified re-writing in the special case you've mentioned but at the cost of a more complicated code.

#372 Updated by Eric Faulhaber about 3 years ago

What is the SQL for the expression "s0 | s1"?

For "(s0 | s1) & s5*"?

(LE: fixed syntax error in original post)

#373 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

What is the SQL for the expression "s0 | s1"?

For "(s0 | s1) & s5*"?

(LE: fixed syntax error in original post)

For the second case (CNF contains two AND terms):

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('s0') OR w1.word = UPPER('s1')))
        join words__words as w2 on (w2.parent__id = t.recid and (w2.word LIKE UPPER('s5%')))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

For the first case, the simplification you've mentioned is applicable.

(Typo in the original post is fixed).

#374 Updated by Eric Faulhaber about 3 years ago

Thank you.

Are these two statements functionally equivalent?

1)

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('s0') OR w1.word = UPPER('s1')))
        join words__words as w2 on (w2.parent__id = t.recid and (w2.word LIKE UPPER('s5%')))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

2)

select  words__imp0_.recid as col_0_0_ 
from
    words words__imp0_ 
where
    (words__imp0_.recid in (
     select distinct w1.recid from words__words w1
     where w1.parent__id = words__imp0_.recid
     and ((w1.word = UPPER('s0') OR w1.word = UPPER('s1')) AND (w1.word LIKE UPPER('s5%)))
))
order by
    words__imp0_.recno asc, words__imp0_.recid asc

If they are functionally equivalent, the second may be a naive implementation. However, I'm curious to know what PostgreSQL does with it...what do the two query plans look like?

#375 Updated by Eric Faulhaber about 3 years ago

As I re-read that last post, I'm not sure my where clause makes sense in that second query...rethinking...

#376 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Thank you.

Are these two statements functionally equivalent?

1)

[...]

2)

[...]

If they are functionally equivalent, the second may be a naive implementation. However, I'm curious to know what PostgreSQL does with it...what do the two query plans look like?

The second query is syntactically incorrect, but I think that understand what you mean. I think that the second query will return an empty set.

The query plan for the first one is:

Sort    [NULL]    208497.13 - 208556.98    196608    1761.439    [NULL]
    Hash Join    [NULL]    183918.66 - 206755.68    196608    1695.132    (words__imp0_.recid = t.recid)
        Seq Scan    words    0.00 - 20084.55    1048555    65.789    [NULL]
    Hash    [NULL]    183619.38 - 183619.38    196608    1432.966    [NULL]
        Aggregate    [NULL]    183140.54 - 183379.96    196608    1409.729    [NULL]
            Gather    [NULL]    42318.90 - 183080.69    262144    1315.005    [NULL]
                Nested Loop    [NULL]    41318.90 - 179686.49    87381    1282.814    [NULL]
                    Hash Join    [NULL]    41318.46 - 98673.51    174763    438.031    (w1.parent__id = t.recid)
                        Bitmap Heap Scan    words__words    4030.97 - 56371.08    174763    81.685    [NULL]
                            BitmapOr    [NULL]    4030.97 - 4030.97    0    26.126    [NULL]
                                Bitmap Index Scan    idx__words__words    0.00 - 2121.58    262144    14.738    (word = 'S0'::text)
                                Bitmap Index Scan    idx__words__words    0.00 - 1803.82    262144    11.386    (word = 'S1'::text)
                    Hash    [NULL]    20084.55 - 20084.55    1048555    204.799    [NULL]
                        Seq Scan    words    0.00 - 20084.55    1048555    96.351    [NULL]
                    Index Scan    words__words    0.43 - 0.91    0    0.005    (parent__id = t.recid)

#377 Updated by Eric Faulhaber about 3 years ago

I'm trying to figure out whether there is a way to express these search conditions without the hash joins (and sometimes nested hash joins). Perhaps that is the fastest way to execute these queries, but I wonder what PostgreSQL can do with alternate expressions.

A separate thought: what if we remove the distinct keyword? Unless I'm reading the plan wrong, the "Aggregate" part seems to be adding significant expense, and the in is a set operation, so it should eliminate duplicate matches. This may just transfer the cost from "Aggregate" to the next level up (or it may aggregate implicitly anyway, because of the in operator), but I want to investigate anything that might change the plan at this point.

#378 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

A separate thought: what if we remove the distinct keyword? Unless I'm reading the plan wrong, the "Aggregate" part seems to be adding significant expense, and the in is a set operation, so it should eliminate duplicate matches. This may just transfer the cost from "Aggregate" to the next level up (or it may aggregate implicitly anyway, because of the in operator), but I want to investigate anything that might change the plan at this point.

Well, I understand that IN (SELECT ...) and IN (SELECT DISTINCT...) are equivalent. I expect that the IN (SELECT DISTINCT...) can be a little more efficient but I can be wrong.

#379 Updated by Igor Skornyakov about 3 years ago

I've noticed some strange thing.
Consider the following 4GL code:

for each ttwiabcdefghijklmnopqrstuvwxyxz where (if1abcdefghijklmnopqrstuvwxyxz contains expr)  or  (if1abcdefghijklmnopqrstuvwxyxz contains 'big & elephant | small & rabbit'):

It is converted to

            query0.initialize(ttwiabcdefghijklmnopqrstuvwxyxz, "contains(ttwiabcdefghijklmnopqrstuvwxyxz.if1abcdefghijklmnopqrstuvwxyxz, ?) or contains(ttwiabcdefghijklmnopqrstuvwxyxz.if1abcdefghijklmnopqrstuvwxyxz, 'big & elephant | small & rabbit')", null, "ttwiabcdefghijklmnopqrstuvwxyxz.recid asc", new Object[]
            {
               expr
            });

However, the FQAst instance which is an argument of the FqlToSqlConverter.processSelect(root) call is:

select statement [SELECT_STMT]:null @0:0
   select [SELECT]:null @1:1
      ttwiabcdefghijklmnopqrstuvwxyxz [ALIAS]:null @1:8
         recid [PROPERTY]:null @0:0
   from [FROM]:null @1:46
      Ttwiabcdefghijklmnopqrstuvwxyxz__Impl__ [DMO]:null @1:51
      ttwiabcdefghijklmnopqrstuvwxyxz [ALIAS]:null @1:94
   where [WHERE]:null @1:126
      or [OR]:null @1:217
         contains [FUNCTION]:null @1:133
            rtrim [FUNCTION]:null @1:142
               ttwiabcdefghijklmnopqrstuvwxyxz [ALIAS]:null @1:148
                  if1abcdefghijklmnopqrstuvwxyxz [PROPERTY]:null @0:0
            ?0 [POSITIONAL]:null @1:213
                  (index=0)
         contains [FUNCTION]:null @1:220
            rtrim [FUNCTION]:null @1:229
               ttwiabcdefghijklmnopqrstuvwxyxz [ALIAS]:null @1:235
                  if1abcdefghijklmnopqrstuvwxyxz [PROPERTY]:null @0:0
            'big & elephant | small & rabbit' [STRING]:null @1:300
   order [ORDER]:null @1:337
      ttwiabcdefghijklmnopqrstuvwxyxz [ALIAS]:null @1:346
         recid [PROPERTY]:null @0:0
      asc [ASC]:null @1:384

Where the RTRIM node comes from and for what reason? Can I expect that contains argument will always be wrapped with rtrim?
Thank you.

#380 Updated by Eric Faulhaber about 3 years ago

The RTRIM is injected for text property references by the HQLPreprocessor. It is needed for normal text property equality and range matches because Progress ignores trailing white space when comparing text values within the WHERE clause. You will notice all the PostgreSQL indices on (non word indexed) text columns are functional (upper(rtrim(<column_name>)) for case-insensitive columns or rtrim(<column_name>) for case-sensitive columns). However, I suppose this is not necessary for the CONTAINS operator, since AFAIK you already trim the whitespace when parsing the text data into the words table.

#381 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

The RTRIM is injected for text property references by the HQLPreprocessor. It is needed for normal text property equality and range matches because Progress ignores trailing white space when comparing text values within the WHERE clause. You will notice all the PostgreSQL indices on (non word indexed) text columns are functional (upper(rtrim(<column_name>)) for case-insensitive columns or rtrim(<column_name>) for case-sensitive columns). However, I suppose this is not necessary for the CONTAINS operator, since AFAIK you already trim the whitespace when parsing the text data into the words table.

I see, that you. Actually, it was not important in my previous code since I've ignored all function calls around the contains argument. I will do the same with the new version but it should be done in a little bit different way.

#382 Updated by Eric Faulhaber about 3 years ago

Well, it is extra work to add it in HQLPreprocessor and I imagine there probably is some work to ignore it when generating the SQL, so we probably should not be adding it in the first place, in the CONTAINS case. Please leave a TODO in the code for this.

#383 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Well, it is extra work to add it in HQLPreprocessor and I imagine there probably is some work to ignore it when generating the SQL, so we probably should not be adding it in the first place, in the CONTAINS case. Please leave a TODO in the code for this.

OK. Thank you.

#384 Updated by Igor Skornyakov about 3 years ago

Added SQL re-writing using common table expressions (WITH clause).
Committed to 3821c/12077.

For PostgreSQL, it is possible to switch between the previous re-writing mode and re-writing with CTE. This can be done using the server JVM argument -DP2JPostgreSQLDialect.useCte4Contains=true|false. For H2 only CTE OFF mode is enabled.
At this moment the default mode is CTE OFF. After additional testing, I will make the CTE ON mode default both for PostgreSQL and H2.

#385 Updated by Igor Skornyakov about 3 years ago

I've noticed that queries with CONTAINS run faster with CTE when the expression is 'simple' (its CNF contains only one AND term), but with more complicated expressions it is not the fact.
Based on this I've implemented a 'mixed' mode when CTE is used only for 'simple' expressions while in other cases an inline sub-query is used.
Here are the results of performance testing (see #1587-357) in different modes. The last four columns are elapsed times in ms.

expression records UDF ON CTE OFF CTE ON/MIX OFF MIX
'a0' 1 5864 229 169 140
'b0' 2 5631 142 80 68
'c0' 4 5587 144 81 71
'g0' 64 5794 266 134 123
'g0 & g1' 32 6116 17 1989 15
'h0' 128 5848 282 131 135
'h0 & h1' 64 5771 22 3732 24
'h0 & h1 & h2' 32 5928 15 16 14
'r0' 131072 16686 11846 12203 11750
'r0 & r1' 65536 12205 6728 6559 6236
'r0 & r1 & r2' 32768 11018 4069 8376 3937
's0' 262144 26363 22042 23237 22969
's0 & s1' 131072 16808 13173 13453 12914
's0 & s1 & s2' 65536 13408 9395 8773 9957
't0' 0 5698 239 252 308
't0 & t1' 0 5782 1 248 2
't0 & t1 & t2' 0 5903 2 2 2

Now the 'mixed' mode is a default one. For PostgreSQL the mode can be changed using server JVM options: P2JPostgreSQLDialect.useUdf4Contains - switch to UDF mode, P2JPostgreSQLDialect.useCte4Contain switch to 'pure' CTE mode, P2JPostgreSQLDialect.useMixedMode4Contains - switch to 'mixed CTE mode'.

For completness see also the results of the performance testing with H2 in a default 'mixed' mode:

expr records elapsed time
'a0' 1 384
'b0' 2 7
'c0' 4 7
'g0' 64 20
'g0 & g1' 32 16
'h0' 128 32
'h0 & h1' 64 18
'h0 & h1 & h2' 32 8
'r0' 131072 5671
'r0 & r1' 65536 3122
'r0 & r1 & r2' 32768 2550
's0' 262144 19994
's0 & s1' 131072 6186
's0 & s1 & s2' 65536 5410
't0' 0 2
't0 & t1' 0 1
't0 & t1 & t2' 0 1

The 'smoke' test with the most recent build of a large customer application passed OK.

Committed to 3821c/12079.
Please review.
Thank you.

#386 Updated by Igor Skornyakov about 3 years ago

I understand that the order of terms in the CONTAINS@ expression is important for implicit sorting. So I'm going to re-order the terms in the CNF in the following way.
  • The order of terms in the original expression imposes an ordering on a set of its terms
  • We re-order OR terms in every AND term of the CNF according to this order. This imposes a lexicographic ordering on a set of AND terms
  • We re-order AND terms according to the lexicographic ordering described above.

I think that get a 4GL-compliant natural ordering on the CTE:

     select distinct recid from words t
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER(?) OR w1.word = UPPER(?)))
        join words__words as w2 on (w2.parent__id = t.recid and (w2.word ?'))

It should be re-written like this:

     select UPPER(w1.word), w2.word from words t, recid
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER(?) OR w1.word = UPPER(?)))
        join words__words as w2 on (w2.parent__id = t.recid and (w2.word ?'))
     order by 1,2,3

The order of terms in the CTE is supposed to be based on an order of terms in the CNF reordered as described in this note.

More testing is required to ensure that the described implicit order is indeed 4GL compliant.

#387 Updated by Igor Skornyakov about 3 years ago

Added CNF ordering (see #1576-386).
Committed to 3821c/12089

#388 Updated by Igor Skornyakov about 3 years ago

The problem with

     select UPPER(w1.word), w2.word from words t, recid
        join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER(?) OR w1.word = UPPER(?)))
        join words__words as w2 on (w2.parent__id = t.recid and (w2.word ?'))
     order by 1,2,3

is that recid values in the result set can be duplicated. Thinking how to modify it.

#389 Updated by Igor Skornyakov about 3 years ago

I've created the following test to analyze the implicit sorting imposed by CONTAINS (see testcases/words/words-order.p) in sftp://xfer.goldencode.com/opt/testcases/

OUTPUT TO order.out.
RUN test('(e0 | e2) & (e1 | e3)').
RUN test('(e1 | e3) & (e0 | e2)').
RUN test('(f0 | f2 | f4) & (f1 | f3 | f5)').

OUTPUT CLOSE.

PROCEDURE test:
    DEF INPUT PARAM expr AS CHAR.
    def var n as int no-undo.
    message expr.
    n = 0.
    message '| #  | recno| words |'.
    for each words where (words contains expr):
        n = n + 1.
        message '|' n '|' recno '|' words '|'.
    end.
END.  

The results are:

'(e0 | e2) & (e1 | e3)'
# recno words
1 31 e0 e1 e2 e3 e4
2 3 e0 e1
3 7 e0 e1 e2
4 9 e0 e3
5 11 e0 e1 e3
6 13 e0 e2 e3
7 15 e0 e1 e2 e3
8 19 e0 e1 e4
9 23 e0 e1 e2 e4
10 25 e0 e3 e4
11 27 e0 e1 e3 e4
12 29 e0 e2 e3 e4
13 30 e1 e2 e3 e4
14 6 e1 e2
15 12 e2 e3
16 14 e1 e2 e3
17 22 e1 e2 e4
18 28 e2 e3 e4
'(e1 | e3) & (e0 | e2)'
# recno words
1 30 e1 e2 e3 e4
2 31 e0 e1 e2 e3 e4
3 3 e0 e1
4 6 e1 e2
5 7 e0 e1 e2
6 11 e0 e1 e3
7 14 e1 e2 e3
8 15 e0 e1 e2 e3
9 19 e0 e1 e4
10 22 e1 e2 e4
11 23 e0 e1 e2 e4
12 27 e0 e1 e3 e4
13 9 e0 e3
14 12 e2 e3
15 13 e0 e2 e3
16 25 e0 e3 e4
17 28 e2 e3 e4
18 29 e0 e2 e3 e4
'(f0 | f2 | f4) & (f1 | f3 | f5)'
# recno words
1 31 f0 f1 f2 f3 f4
2 33 f0 f5
3 35 f0 f1 f5
4 37 f0 f2 f5
5 39 f0 f1 f2 f5
6 41 f0 f3 f5
7 43 f0 f1 f3 f5
8 45 f0 f2 f3 f5
9 47 f0 f1 f2 f3 f5
10 49 f0 f4 f5
11 51 f0 f1 f4 f5
12 53 f0 f2 f4 f5
13 55 f0 f1 f2 f4 f5
14 57 f0 f3 f4 f5
15 59 f0 f1 f3 f4 f5
16 61 f0 f2 f3 f4 f5
17 3 f0 f1
18 7 f0 f1 f2
19 9 f0 f3
20 11 f0 f1 f3
21 13 f0 f2 f3
22 15 f0 f1 f2 f3
23 19 f0 f1 f4
24 23 f0 f1 f2 f4
25 25 f0 f3 f4
26 27 f0 f1 f3 f4
27 29 f0 f2 f3 f4
28 63 f0 f1 f2 f3 f4 f5
29 36 f2 f5
30 38 f1 f2 f5
31 44 f2 f3 f5
32 46 f1 f2 f3 f5
33 52 f2 f4 f5
34 54 f1 f2 f4 f5
35 60 f2 f3 f4 f5
36 62 f1 f2 f3 f4 f5
37 6 f1 f2
38 12 f2 f3
39 14 f1 f2 f3
40 22 f1 f2 f4
41 28 f2 f3 f4
42 30 f1 f2 f3 f4
43 48 f4 f5
44 50 f1 f4 f5
45 56 f3 f4 f5
46 58 f1 f3 f4 f5
47 18 f1 f4
48 24 f3 f4
49 26 f1 f3 f4

As we can see at the beginning of the result set are the records contains the first word of the expression, then ones containing the second word but not the first, .. etc.
However, I do not see any logic in the ordering of records inside the subsets corresponding to the words.

#390 Updated by Igor Skornyakov about 3 years ago

I doubt that it is possible to implement an implicit ordering imposed by CONTAINS which is 100% compatible with 4GL (at least in a reasonably efficient way).
However, I can suggest a 'partially compatible' ordering with is fully compatible with 4GL with respect to an order of subsets of a result set related to the words of the CONTAINS expression. The idea is to add a vector of 'weights' to the CTE based on the ordered set of words on the expression. For example the SQL for words contains '(e0 | e2) & (e1 | e3)' will be:

with wr as (
        select sum(cast(w1.word = UPPER('e0') as int)) as ww1,  sum(cast(w1.word = UPPER('e2') as int)) as ww2,
               sum(cast(w2.word = UPPER('e1') as int)) as ww3,  sum(cast(w2.word = UPPER('e3') as int)) as ww4,
               recid 
                from words t
                join words__words as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('e0') or w1.word = UPPER('e2')))
                join words__words as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('e1') or w2.word = UPPER('e3')))
        group by recid
)
select  words.* 
from words, wr where words.recid = wr.recid
order by wr.ww1 desc, wr.ww2 desc, wr.ww2 desc, wr.ww4 desc, wr.recid

For PostgreSQL, the aggregate expression sum(cast(? as int)) can be replaced with max(cast(? as int)) or even bool_or(?) that can be cheaper but with the same level of compatibility with 4GL.
What do you think?
Thank you.

#391 Updated by Eric Faulhaber about 3 years ago

This is a very interesting proposal, and I appreciate the creativity in coming up with it.

In the absence of a BY clause, I don't know of any guaranteed ordering of the results of a query for which a word index is selected, such as we have with a regular index. Unless and until we determine this, we need to go with an approach which results in the most deterministic sort we can figure out. It seems like you have done this, so now the trick is to minimize the performance impact of implementing that sort. Do you have an idea of the impact of your current proposal on performance?

#392 Updated by Eric Faulhaber about 3 years ago

Another question/concern: it seems like this approach requires the CTE. It follows that it would disrupt your more performant, "mixed mode" approach. Is this correct? Do we lose the performance benefit of the cases where the CTE was generally slower, and the algorithm would have chosen the non-CTE approach?

#393 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

In the absence of a BY clause, I don't know of any guaranteed ordering of the results of a query for which a word index is selected, such as we have with a regular index. Unless and until we determine this, we need to go with an approach which results in the most deterministic sort we can figure out. It seems like you have done this, so now the trick is to minimize the performance impact of implementing that sort. Do you have an idea of the impact of your current proposal on performance?

Well, I've not noticed any noticeable performance degradation with SQL queries running directly. I'm working now on the implementation. It should not take long so we'll be able to see the results of the performance testing tomorrow at the latest.

#394 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Another question/concern: it seems like this approach requires the CTE. It follows that it would disrupt your more performant, "mixed mode" approach. Is this correct? Do we lose the performance benefit of the cases where the CTE was generally slower, and the algorithm would have chosen the non-CTE approach?

Well, there can be some performance degradation. Please note however that the CTE will be used for complicated queries only if the corresponding word index will be selected for implicit sorting. I will make using the sorting configurable so it will be possible to compare the performance with and without the sorting and see the real impact.

#395 Updated by Eric Faulhaber about 3 years ago

Code review 3821c, rev 12079, 12089:

The logic changes look good to me. I have not tested them.

W.r.t. the command line options, we don't want to use system properties long term. For any options you intend to survive beyond your development-time testing, please convert them to directory configuration options. Thanks.

#396 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

W.r.t. the command line options, we don't want to use system properties long term. For any options you intend to survive beyond your development-time testing, please convert them to directory configuration options. Thanks.

Sure. I use command-line options now because it is more convenient for testing.

#397 Updated by Eric Faulhaber about 3 years ago

Igor, other than finishing the sort implementation (and gauging its performance) and the 1688 conversion warning, what is left at this point to finish this task? SQL Server support is deferred, but should be left working at least as well as before.

#398 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor, other than finishing the sort implementation (and gauging its performance) and the 1688 conversion warning, what is left at this point to finish this task? SQL Server support is deferred, but should be left working at least as well as before.

According to #1587-327, you wanted also to implement word tables for _temp db.
SQL Server support is working with (fixed) UDF.

#399 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

According to #1587-327, you wanted also to implement word tables for _temp db.

Given that support for H2 already is implemented, what is the incremental effort for this?

#400 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:
Given that support for H2 already is implemented, what is the incremental effort for this?

I have no experience with temp-tables creation but I hope that it will not take long.

#401 Updated by Eric Faulhaber about 3 years ago

OK, please post any questions you may have about the temp-table creation, DDL, etc. Ovidiu or I will be able to get you answers.

#402 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

OK, please post any questions you may have about the temp-table creation, DDL, etc. Ovidiu or I will be able to get you answers.

I will. Do we have any description to start with?
Thank you.

#403 Updated by Eric Faulhaber about 3 years ago

Not as such. The documentation on FWD internals is the least developed of the various books.

The code which creates and drops temporary tables is in TemporaryBuffer$Context. It is invoked in response to buffer scopes being opened and closed at the application level. I would recommend looking at the call hierarchy for the doCreateTable and doDropTable methods to gain an understanding of this part of the process.

This code uses LocalTempTableHelper to gather the DDL it uses to create and drop tables (and their indices). I thought at one point we were using DDLGeneratorWorker as part of the this process, but I'm not sure that is still the case. Ovidiu?

#404 Updated by Ovidiu Maxiniuc about 3 years ago

These two classes are really specialised:
  • the DDLGeneratorWorker is used by static conversion to get DDLs for permanent tables and sequences. It has a interface designed to work with TRPL 'primitives' as they are processed while traversing the p2o tree;
  • the TempTableHelper constructor is the one which creates the DDLs for temp-tables.

#405 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Not as such. The documentation on FWD internals is the least developed of the various books.

The code which creates and drops temporary tables is in TemporaryBuffer$Context. It is invoked in response to buffer scopes being opened and closed at the application level. I would recommend looking at the call hierarchy for the doCreateTable and doDropTable methods to gain an understanding of this part of the process.

This code uses LocalTempTableHelper to gather the DDL it uses to create and drop tables (and their indices). I thought at one point we were using DDLGeneratorWorker as part of the this process, but I'm not sure that is still the case. Ovidiu?

I see. Thank you!

#406 Updated by Igor Skornyakov about 3 years ago

Ovidiu Maxiniuc wrote:

These two classes are really specialised:
  • the DDLGeneratorWorker is used by static conversion to get DDLs for permanent tables and sequences. It has a interface designed to work with TRPL 'primitives' as they are processed while traversing the p2o tree;
  • the TempTableHelper constructor is the one which creates the DDLs for temp-tables.

Thank you, Ovidiu. It seems that some refactoring will be required to use word tables' related code both in DDLGeneratorWorker and TempTableHelper .

#407 Updated by Igor Skornyakov about 3 years ago

The final logic for implicit sorting appeared to be more tricky than I've described initially. In particular, a situation when there is more than one CONTAINS operator in the query should be addressed.
For example the query

for each ttwiabcdefghijklmnopqrstuvwxyxz where (if2abcdefghijklmnopqrstuvwxyxz contains 'big & elephant | small & rabbit')  or (if1abcdefghijklmnopqrstuvwxyxz contains '(cat* | ant*)'):

Is converted to:
with
wcte1 as (
     select sum(cast(w1.word = ? as int)) as w11, sum(cast(w1.word = ? as int)) as w13, 
    sum(cast(w2.word = ? as int)) as w21, sum(cast(w2.word = ? as int)) as w24, 
    sum(cast(w3.word = ? as int)) as w32, sum(cast(w3.word = ? as int)) as w33, 
    sum(cast(w4.word = ? as int)) as w42, sum(cast(w4.word = ? as int)) as w44, 
    recid
     from ttwiabcdefghijklmnopqrstuvwxyxz t
        join ttwiabcdefghijklmnopqrstuvwxyxz__if2abcdefghijklmnopqrstuvwxyxz as w1 on (w1.parent__id = t.recid and (w1.word = ? or w1.word = ?))
        join ttwiabcdefghijklmnopqrstuvwxyxz__if2abcdefghijklmnopqrstuvwxyxz as w2 on (w2.parent__id = t.recid and (w2.word = ? or w2.word = ?))
        join ttwiabcdefghijklmnopqrstuvwxyxz__if2abcdefghijklmnopqrstuvwxyxz as w3 on (w3.parent__id = t.recid and (w3.word = ? or w3.word = ?))
        join ttwiabcdefghijklmnopqrstuvwxyxz__if2abcdefghijklmnopqrstuvwxyxz as w4 on (w4.parent__id = t.recid and (w4.word = ? or w4.word = ?))
    group by recid
    union all
    select
    -1 as w11, -1 as w13, 
    -1 as w21, -1 as w24, 
    -1 as w32, -1 as w33, 
    -1 as w42, -1 as w44, 
     null as recid)
,
wcte2 as (
     select distinct recid
     from ttwiabcdefghijklmnopqrstuvwxyxz t
        join ttwiabcdefghijklmnopqrstuvwxyxz__if1abcdefghijklmnopqrstuvwxyxz as w1 on (w1.parent__id = t.recid and (w1.word like UPPER(?) or w1.word like UPPER(?)))
)
select 
    ttwiabcdef0_.recid as col_0_0_ 
from
    ttwiabcdefghijklmnopqrstuvwxyxz ttwiabcdef0_, wcte1 
where
    (ttwiabcdef0_.recid = wcte1.recid and wcte1.recid is null) or (ttwiabcdef0_.recid in (select recid from wcte2) and wcte1.recid is null)
order by
    wcte1.w11 desc,wcte1.w13 desc,
    wcte1.w21 desc,wcte1.w24 desc,
    wcte1.w32 desc,wcte1.w33 desc,
    wcte1.w42 desc,wcte1.w44 desc,
    ttwiabcdef0_.recid asc

The query
for each ttwiabcdefghijklmnopqrstuvwxyxz where (f-ext contains '(word1* | word3*) & (word12c | word33c)')  or  (f-ext contains 'word22a & word22b | word23a & word23b'):

where f-ext is an extent field, to:
with
wcte1 as (
     select sum(cast(w1.word like UPPER(?) as int)) as w11, sum(cast(w1.word like UPPER(?) as int)) as w12, 
    sum(cast(w2.word = UPPER(?) as int)) as w23, sum(cast(w2.word = UPPER(?) as int)) as w24, 
    recid
     from ttwiabcdefghijklmnopqrstuvwxyxz t
        join ttwiabcdefghijklmnopqrstuvwxyxz__5 e on (e.parent__id = t.recid)
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word like UPPER(?) or w1.word like UPPER(?)))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = UPPER(?) or w2.word = UPPER(?)))
    group by recid
    union all
    select
    -1 as w11, -1 as w12, 
    -1 as w23, -1 as w24, 
     null as recid)

select 
    ttwiabcdef0_.recid as col_0_0_ 
from
    ttwiabcdefghijklmnopqrstuvwxyxz ttwiabcdef0_, wcte1 
where
    (ttwiabcdef0_.recid = wcte1.recid and wcte1.recid is null) or (ttwiabcdef0_.recid in (
     select distinct recid
     from ttwiabcdefghijklmnopqrstuvwxyxz t
        join ttwiabcdefghijklmnopqrstuvwxyxz__5 e on (e.parent__id = t.recid)
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word = UPPER(?) or w1.word = UPPER(?)))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = UPPER(?) or w2.word = UPPER(?)))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w3 on (w3.parent__id = e.parent__id and w3.list__index  = e.list__index and (w3.word = UPPER(?) or w3.word = UPPER(?)))
        join ttwiabcdefghijklmnopqrstuvwxyxz__f_ext as w4 on (w4.parent__id = e.parent__id and w4.list__index  = e.list__index and (w4.word = UPPER(?) or w4.word = UPPER(?)))
) and wcte1.recid is null)
order by
    wcte1.w11 desc,wcte1.w12 desc,
    wcte1.w23 desc,wcte1.w24 desc,
    ttwiabcdef0_.recid asc

The results of the performance testing with different aggregate functions see below (please note that it is a 'cold run'):

expr records bool-or(?) max(cast(? as int)) sum(cast(? as int)
'a0' 1 1049 159 166
'b0' 2 72 75 75
'c0' 4 75 73 72
'g0' 64 128 126 145
'g0 & g1' 32 19 15 18
'h0' 128 152 142 137
'h0 & h1' 64 24 25 25
'h0 & h1 & h2' 32 30 16 18
'r0' 131072 13724 13197 13010
'r0 & r1' 65536 8351 7005 7026
'r0 & r1 & r2' 32768 4161 4182 4226
's0' 262144 27737 24841 25074
's0 & s1' 131072 14068 13650 14245
's0 & s1 & s2' 65536 10702 8978 9153
't0' 0 298 245 243
't0 & t1' 0 1 2 1
't0 & t1 & t2' 0 2 2 2

As we can see the results are essentially the same as without sorting logic (#1587-385) and there is no significant difference between different aggregate functions.

This is the results of the prformance test for H2 (with sum aggregator):
expr records elapsed time
'a0' 1 417
'b0' 2 6
'c0' 4 7
'g0' 64 31
'g0 & g1' 32 24
'h0' 128 37
'h0 & h1' 64 15
'h0 & h1 & h2' 32 11
'r0' 131072 6983
'r0 & r1' 65536 3824
'r0 & r1 & r2' 32768 2867
's0' 262144 23613
's0 & s1' 131072 7079
's0 & s1 & s2' 65536 6595
't0' 0 12
't0 & t1' 0 1
't0 & t1 & t2' 0 1

Committed to 3821c/12102.
I'm working now on word tables for the _temp database.

#408 Updated by Eric Faulhaber about 3 years ago

Just saw this in the H2 mailing list, FWIW. Noel is one of the main contributors to the project. The OP was reporting a problem with creating a view of a CTE, so not exactly our use case.


Subject: Re: [h2] RFE: Enable Common Table Expression (CTE with...) in INSERT, UPDATE, DELETE, CREATE TABLE AS, CREATE VIEW AS
Date: Thu, 11 Mar 2021 12:04:59 +0200
From: Noel Grandin <>
Reply-To:
To: H2 Database <>

You can try the HEAD of the main git repo, but CTE's in H2 are a bit of a hack and still have a lot of issues.

#409 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Just saw this in the H2 mailing list, FWIW. Noel is one of the main contributors to the project. The OP was reporting a problem with creating a view of a CTE, so not exactly our use case.


Subject: Re: [h2] RFE: Enable Common Table Expression (CTE with...) in INSERT, UPDATE, DELETE, CREATE TABLE AS, CREATE VIEW AS
Date: Thu, 11 Mar 2021 12:04:59 +0200
From: Noel Grandin <>
Reply-To:
To: H2 Database <>

You can try the HEAD of the main git repo, but CTE's in H2 are a bit of a hack and still have a lot of issues.

Yes, I've experienced problems with creating views using CTEs for H2 in the past. In fact, H2 creates a separate view for CTEs in views but sometimes fails to do this. In my case, I was able to find a workaround by creating these auxiliary views myself. Actually using CTE in views is more a less just a shortcut. In our case, CTE uses query parameters so we really need them.

#410 Updated by Igor Skornyakov about 3 years ago

Sorry for my ignorance. What is 'dirty' database and should I consider such databases for the word tables support?
Thank you.

#411 Updated by Igor Skornyakov about 3 years ago

Another question. I need to ensure that data objects created for word tables are unique, even in the scope of a session. I do not see any standard way to validate that the generated name is unique in the table scope. Have I missed something?
Thank you.

#412 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

What is 'dirty' database and should I consider such databases for the word tables support?

The dirty database is an in-memory, H2 database accessible to all contexts, which shares some database update information across sessions, to emulate a quirk where the 4GL "leaks" this information when indices are updated, even though the enclosing transaction has not yet been committed. For certain query types (currently only FINDs), an additional query is executed against the dirty database and the results are compared/merged/integrated with the results of the primary query.

TBH, I don't recall whether we track these updates for word indices. There normally would be a very small number of records in the dirty database, because they are cleared for a given context when that session's current transaction is committed or rolled back. So, even if the dirty database is used for word indices, using the UDF implementation and not maintaining word tables there should be sufficient. If you have a Database object at the time of making decisions about whether a query is operating on the dirty database, you can invoke the isDirty method to determine this.

#413 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

What is 'dirty' database and should I consider such databases for the word tables support?

The dirty database is an in-memory, H2 database accessible to all contexts, which shares some database update information across sessions, to emulate a quirk where the 4GL "leaks" this information when indices are updated, even though the enclosing transaction has not yet been committed. For certain query types (currently only FINDs), an additional query is executed against the dirty database and the results are compared/merged/integrated with the results of the primary query.

TBH, I don't recall whether we track these updates for word indices. There normally would be a very small number of records in the dirty database, because they are cleared for a given context when that session's current transaction is committed or rolled back. So, even if the dirty database is used for word indices, using the UDF implementation and not maintaining word tables there should be sufficient. If you have a Database object at the time of making decisions about whether a query is operating on the dirty database, you can invoke the isDirty method to determine this.

I see, thank you. So I have to add a check for isDirty and do not re-write the query for the 'dirty' database? Please note however that I've changed the support for UDF for extent fields which was incorrect. This fix currently works for the _temp database, Should I somehow enable it for 'dirty' when the word tables will be implemented for the _temp?
Thank you.
Thank you.

#414 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

So I have to add a check for isDirty and do not re-write the query for the 'dirty' database?

This is not a hard requirement, but my assumption is that this approach will be easier to implement. It should not be necessary to rewrite the query or maintain word tables for the dirty database. If my assumption is incorrect from your point of view, please let me know.

Please note however that I've changed the support for UDF for extent fields which was incorrect. This fix currently works for the _temp database, Should I somehow enable it for 'dirty' when the word tables will be implemented for the _temp?

Yes, the dirty database should work properly with extent fields, if we go with the UDF approach.

BTW, I should have mentioned earlier: the dirty database is only used with permanent tables. Temp-table data of course is never shared across contexts.

#415 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Yes, the dirty database should work properly with extent fields, if we go with the UDF approach.

BTW, I should have mentioned earlier: the dirty database is only used with permanent tables. Temp-table data of course is never shared across contexts.

Oh. this is important! I have to adjust my code accordingly. It should be a one-liner.
Thank you!

#416 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Another question. I need to ensure that data objects created for word tables are unique, even in the scope of a session. I do not see any standard way to validate that the generated name is unique in the table scope. Have I missed something?

When you say "ensure that data objects created for word tables are unique", do you just mean the table name, or something in addition to this?

For the table name, I am not sure how we can generate what is certain to be a unique name, because even if we know of all the possible statically converted temp-tables which can be created, there can be dynamically prepared temp-tables with names we cannot know in advance. We would need to generate something that static or dynamic conversion would never generate.

Can back-ticks help us here, to enclose a name which wouldn't otherwise be used? It's an admittedly ugly workaround, though readability is not as important as uniqueness in this case, since this table would only ever be accessed programmatically.

#417 Updated by Eric Faulhaber about 3 years ago

Some google searches suggest '@' is a valid character in an SQL table name, but not in a 4GL table name. This will need to be confirmed with testing, but if so, we can use this to our advantage. AFAIR, we don't add or substitute the '@' character in any of our default name conversion. However, if we use this, it will have to be detected and rejected in name conversion, because it is possible to customize the name conversion with custom replacements, and this character would no longer be valid in such a replacement, once we reserve it for word table use.

#418 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:
When you say "ensure that data objects created for word tables are unique", do you just mean the table name, or something in addition to this?

For the table name, I am not sure how we can generate what is certain to be a unique name, because even if we know of all the possible statically converted temp-tables which can be created, there can be dynamically prepared temp-tables with names we cannot know in advance. We would need to generate something that static or dynamic conversion would never generate.

Can back-ticks help us here, to enclose a name which wouldn't otherwise be used? It's an admittedly ugly workaround, though readability is not as important as uniqueness in this case, since this table would only ever be accessed programmatically.

For every word table, I need to create three indexes, one constraint, one function, and two triggers. For denormalized extent fields, one function and one trigger are created for every component. I understand that there is no limitation for database objects' names length for H2, so I can just append a GUID to a 'natural' name to ensure uniqueness. Using just GUIDs is not good since it will make error messages mostly opaque.
What do you think?
Thank you.

#419 Updated by Greg Shah about 3 years ago

Do dynamic temp-tables have names?

#420 Updated by Igor Skornyakov about 3 years ago

Greg Shah wrote:

Do dynamic temp-tables have names?

If they are in the H2 database they should have at least some synthetic names.

#421 Updated by Eric Faulhaber about 3 years ago

Greg Shah wrote:

Do dynamic temp-tables have names?

Yes, they are referenced in subsequent dynamic queries by name. AFAIK, we convert these table names for use with SQL using the normal name conversion rules.

#422 Updated by Eric Faulhaber about 3 years ago

I'm ok with using GUIDs.

#423 Updated by Igor Skornyakov about 3 years ago

BTW. I've just realized that it is possible to add a 'natural' implicit ordering to the queries using the CONTAINS UDF.
Indeed contains(f, 'a | b') = contains(f, 'a') | contains(f, 'b') and contains(f, 'a & b') = contains(f, 'a') & contains(f, 'b'). So if P(t1, t2, ... tn) is a logical expression then contains(f, P(t1, t2, ... tn)) = P(contains(f, t1), contains(f, t2), ... contains(f, tn). This means that we can re-write a query w/o using word table like this:
Instead if where ... contains(f, expr) ... use

with cte as (
   select recid, bool_or(contains(f, t1)) as w1, bool_or(contains(f, t2)) as w2, ... bool_or(contains(f, tn)) as wn
   from ...
   where contains(f, expr)
   group by recid
   [union all...]

)
select ... from t, cte
where t.recid = cte.recid
order by cte.w1, ... cte.wn. t.recid

where t1,... tn are terms of the expr.
This can be used for queries against the 'dirty' database.

Hopefully is not not just a 'late night idea' ))

#424 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Hopefully is not not just a 'late night idea' ))

Well, it seems like an interesting one.

One question: for a given record, is there any efficiency/performance difference evaluating the same set of words N times for N simple match expressions, as opposed to once for a complex match expression with N components? I'm not familiar enough with the new algorithm to know for sure. At a conceptual level, I guess we need the same overall number of word comparisons either way, so maybe this is not a valid concern. Even if there is some penalty, it may not matter for the dirty database, which will most often have a small number of records to check, if any at all.

#425 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

One question: for a given record, is there any efficiency/performance difference evaluating the same set of words N times for N simple match expressions, as opposed to once for a complex match expression with N components? I'm not familiar enough with the new algorithm to know for sure. At a conceptual level, I guess we need the same overall number of word comparisons either way, so maybe this is not a valid concern. Even if there is some penalty, it may not matter for the dirty database, which will most often have a small number of records to check, if any at all.

This is a good question. I was also thinking about this and plan to make a minor change in the code so that the number of weights will be the same as the number of terms in the original expression. This is needed also to merge the results sets from the primary and 'dirty' databases.

#426 Updated by Igor Skornyakov about 3 years ago

  • Re-worked implicit ordering to be compatible with one for CONTAINS with UDF described in #1587-423
  • Refactored DDLGeneratorWorker to make it possible to use word tables with _temp database.
  • Added optional override for in-memory databases (for development) in H2Helper

Committed to 3821c/12123.

#427 Updated by Eric Faulhaber about 3 years ago

Igor, what is left for this task, besides adding the 1688 error to conversion?

#428 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor, what is left for this task, besides adding the 1688 error to conversion?

Eric,
I understand that MS SQL support is postponed. So apart from 1688 warning, the only thing is to finish support for _temp. I'm not sure about 'dirty' databases but this should be easy.

#429 Updated by Eric Faulhaber about 3 years ago

Please refresh my memory...for the 1688 warning, is adding the conversion-time warning the only thing left to be done, or is the code still being converted incorrectly (i.e., is the extent field dereference still part of the FQL where clause)?

#430 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Please refresh my memory...for the 1688 warning, is adding the conversion-time warning the only thing left to be done, or is the code still being converted incorrectly (i.e., is the extent field dereference still part of the FQL where clause)?

We only need to add a conversion-time warning. The runtime behavior is fixed.

#431 Updated by Eric Faulhaber about 3 years ago

If we fix the conversion to drop the extent field dereference, will the current runtime be able to handle the FQL? If we do not fix the conversion, the FQL in the converted source will be misleading, since the runtime will ignore the dereference silently and will instead apply the CONTAINS to all elements of the extent field.

Fixing the conversion to drop the dereference should be trivial once we identify the place to apply the 1688 warning. However, I want to know if this will cause a problem for the runtime.

#432 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

If we fix the conversion to drop the extent field dereference, will the current runtime be able to handle the FQL? If we do not fix the conversion, the FQL in the converted source will be misleading, since the runtime will ignore the dereference silently and will instead apply the CONTAINS to all elements of the extent field.

I was thinking about this. As you remember initially I've added a runtime warning. Actually, I do not think that leaving the subscript in the converted code is misleading since the definition of the CONTAINS operator does not allow this. I understand that we're supposed to convert a valid 4GL code and the customer will see the 4GL warning.

Fixing the conversion to drop the dereference should be trivial once we identify the place to apply the 1688 warning. However, I want to know if this will cause a problem for the runtime.

I also think so.

#433 Updated by Greg Shah about 3 years ago

Actually, I do not think that leaving the subscript in the converted code is misleading since the definition of the CONTAINS operator does not allow this.

It was my understanding that the 4GL does not fail when this occurs and it changes the CONTAINS into a version without subscripts with only a non-fatal 1688 compile-time warning. Is that correct?

I understand that we're supposed to convert a valid 4GL code and the customer will see the 4GL warning.

If my understanding is correct, then this subscript case is valid 4GL, since the 4GL will accept it and run it without errors. This means that WE MUST also accept and run it without errors. I don't want to leave the invalid subscript in our converted code because then programmers will think it is OK. In other words, it is OK in the input 4GL code but we should only ever generate correct/clean Java code.

#434 Updated by Eric Faulhaber about 3 years ago

Greg Shah wrote:

In other words, it is OK in the input 4GL code but we should only ever generate correct/clean Java code.

This is a little bit of a grey area, because the converted output we are talking about is a string containing a converted WHERE clause fragment, which will be preprocessed and injected into a more complete FQL statement at runtime, then converted to SQL. The converted Java code around it is the same either way, AFAIK.

The question is whether we drop the subscript when converting the 4GL WHERE clause into this FQL fragment. I think we should, because I think leaving it there misleadingly suggests that only the specific extent field element identified by the subscript is involved in the CONTAINS operation, when in fact all elements of the extent field are involved, according to Igor's findings.

I can make the conversion changes (including adding the 1688 warning), but I wanted to understand the potential impact on the runtime first. That is, when processing this FQL WHERE clause, can the runtime currently handle the extent field reference without the subscript as the lvalue of a CONTAINS operation. If not, is it a simple change to enable the runtime to handle this? If the answer to either of these last two questions is "yes", I propose that I make the TRPL changes to drop the subscript and add the conversion warning, while Igor makes the changes (if any) to the runtime.

#435 Updated by Greg Shah about 3 years ago

This is a little bit of a grey area, because the converted output we are talking about is a string containing a converted WHERE clause fragment, which will be preprocessed and injected into a more complete FQL statement at runtime, then converted to SQL. The converted Java code around it is the same either way, AFAIK.

The same programmer will edit the FQL string literal as will edit the Java source code. I consider those strings as part of the source.

The question is whether we drop the subscript when converting the 4GL WHERE clause into this FQL fragment. I think we should, because I think leaving it there misleadingly suggests that only the specific extent field element identified by the subscript is involved in the CONTAINS operation, when in fact all elements of the extent field are involved,

I agree. We should never emit something that is incorrect into the converted code, whether it is a string or Java statement/expression.

#436 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

I can make the conversion changes (including adding the 1688 warning), but I wanted to understand the potential impact on the runtime first. That is, when processing this FQL WHERE clause, can the runtime currently handle the extent field reference without the subscript as the lvalue of a CONTAINS operation. If not, is it a simple change to enable the runtime to handle this? If the answer to either of these last two questions is "yes", I propose that I make the TRPL changes to drop the subscript and add the conversion warning, while Igor makes the changes (if any) to the runtime.

Actually, I remove the subscript at runtime (now silently) at a very early stage of the processing. This means that if it will be removed at the conversion time this piece of runtime code will just not work.

#437 Updated by Igor Skornyakov about 3 years ago

Is there any reason why we do not append dialect.getDelimiter() to the drop table statement in the TempTableHelper?
Thank you.

#438 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Is there any reason why we do not append dialect.getDelimiter() to the drop table statement in the TempTableHelper?

None that I'm aware of. I'm not sure the omission is intentional. Is it needed?

#439 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Is there any reason why we do not append dialect.getDelimiter() to the drop table statement in the TempTableHelper?

None that I'm aware of. I'm not sure the omission is intentional. Is it needed?

I think it is needed. Without it, the statements are not properly terminated and cannot be executed as-is. I've not found where these statements are executed. Maybe the delimiter is added there.

#440 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Actually, I remove the subscript at runtime (now silently) at a very early stage of the processing. This means that if it will be removed at the conversion time this piece of runtime code will just not work.

Sorry, to clarify "will just not work", do you mean that removing the subscript at conversion time (a) will break this bit of runtime code because the subscript is expected to be there; or (b) will allow the runtime code to work normally, because it will have nothing to do at this point in the logic (but it will behave downstream as if it had removed the subscript itself)?

If (a), then we will need to coordinate our changes to be committed in the same revision. Can I safely assume the runtime change in this case is not a significant bit of work?

#441 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

I think it is needed. Without it, the statements are not properly terminated and cannot be executed as-is. I've not found where these statements are executed. Maybe the delimiter is added there.

They are executed by TemporaryBuffer$Context.doDropTable, in a JDBC batch, along with any drop index statements. We do not add the delimiters there. Are you sure they are needed when using the JDBC batch execute mechanism?

#442 Updated by Eric Faulhaber about 3 years ago

Actually, we do append the delimiter to the end of the DROP INDEX statements, which are added to the JDBC batch first. Perhaps it was assumed at some point that only one DROP TABLE statement would be executed, and it would always be last in the batch after the DROP INDEX statement(s). Secondary (extent field) tables are deleted by cascade, AFAIR. Feel free to add it, if you need it.

#443 Updated by Ovidiu Maxiniuc about 3 years ago

I think the missing of dialect.getDelimiter() is just an omission. I am sorry for that. Please add it in TempTableHelper c'tor where the drop DDLs are built.
Yet, running the script should have raise some errors. In which case we would have spotted the issue earlier. I wonder why. The only case that comes into my mind is that there is a single statement and the delimiter is not needed?

#444 Updated by Eric Faulhaber about 3 years ago

Ovidiu Maxiniuc wrote:

The only case that comes into my mind is that there is a single statement and the delimiter is not needed?

Until now, I think yes. I think the other related tables (normalized extent field tables), if any, are dropped by cascade. I guess the words table could be, too, but Igor is into the details, not me...

It's not a bad practice to have the delimiter there in any case. Ovidiu, you probably inherited the omission from my original code.

#445 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Sorry, to clarify "will just not work", do you mean that removing the subscript at conversion time (a) will break this bit of runtime code because the subscript is expected to be there; or (b) will allow the runtime code to work normally, because it will have nothing to do at this point in the logic (but it will behave downstream as if it had removed the subscript itself)?

If (a), then we will need to coordinate our changes to be committed in the same revision. Can I safely assume the runtime change in this case is not a significant bit of work?

No runtime changes are required. By "will just not work" I mean that the corresponding piece of code will never be executed. After all the CONTAINS w/o subscription is a "normal" situation and the subscription is a "noise".

#446 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Until now, I think yes. I think the other related tables (normalized extent field tables), if any, are dropped by cascade. I guess the words table could be, too, but Igor is into the details, not me...

It's not a bad practice to have the delimiter there in any case. Ovidiu, you probably inherited the omission from my original code.

I was sure that only records can be dropped by cascade, not tables. Anyway, I will double-check.
Thank you.

#447 Updated by Igor Skornyakov about 3 years ago

Igor Skornyakov wrote:

No runtime changes are required. By "will just not work" I mean that the corresponding piece of code will never be executed. After all the CONTAINS w/o subscription is a "normal" situation and the subscription is a "noise".

The code fragment I'm talking about is lines 2414-2420 in the FqlToSQLConverter:

                  Aast contains = insideContains(grandFather); 
                  if (contains != null)
                  {
                     ast.remove();
                  }

#448 Updated by Igor Skornyakov about 3 years ago

I've encountered a strange problem with trigger for _temp:

[03/17/2021 14:13:55 GMT+03:00] (ErrorManager:SEVERE) {0000000E:00000022:bogus} ** org.h2.jdbc.JdbcSQLSyntaxErrorException: Column "recid" not found; SQL statement:
insert into tt1 (_errorFlag, _originRowid, _errorString, _peerRowid, _rowState, if1, _multiplex, recid) values (?, ?, ?, ?, ?, ?, ?, ?) [42122-200]
        at org.h2.message.DbException.getJdbcSQLException(DbException.java:453)
        at org.h2.message.DbException.getJdbcSQLException(DbException.java:429)
        at org.h2.message.DbException.getJdbcSQLException(DbException.java:415)
        at org.h2.tools.SimpleResultSet.findColumn(SimpleResultSet.java:281)
        at org.h2.tools.SimpleResultSet.getLong(SimpleResultSet.java:763)
        at com.goldencode.p2j.persist.h2.OnInsertWords.fire(OnInsertWords.java:112)
        at org.h2.tools.TriggerAdapter.fire(TriggerAdapter.java:144)
        at org.h2.schema.TriggerObject.fireRow(TriggerObject.java:261)
        at org.h2.table.Table.fireRow(Table.java:1090)
        at org.h2.table.Table.fireAfterRow(Table.java:1080)
        at org.h2.command.dml.Insert.insertRows(Insert.java:211)
        at org.h2.command.dml.Insert.update(Insert.java:151)
        at org.h2.command.CommandContainer.update(CommandContainer.java:198)
        at org.h2.command.Command.executeUpdate(Command.java:251)
        at org.h2.server.TcpServerThread.process(TcpServerThread.java:406)
        at org.h2.server.TcpServerThread.run(TcpServerThread.java:183)
        at java.lang.Thread.run(Thread.java:748)

The OnInsertWords.fire is:

   @Override
   public void fire(Connection conn, ResultSet oldRow, ResultSet newRow) 
   throws SQLException
   {
      long pk = newRow.getLong(pkName);
      String field = newRow.getString(fieldName);
      insertWords(conn, pk, field);
   }

The value of the pkName is recid. Looks like an H2 bug. Investigating...

#449 Updated by Igor Skornyakov about 3 years ago

Igor Skornyakov wrote:

I've encountered a strange problem with trigger for _temp:
[...]

The OnInsertWords.fire is:
[...]
The value of the pkName is recid. Looks like an H2 bug. Investigating...

Well, for the temporary tables the SimpleResultSet.columns list is empty, so it is impossible to retrieve the field value by the field name. Looking for a workaround.

#450 Updated by Igor Skornyakov about 3 years ago

It seems that for H2 temporary tables it is impossible to use triggers. H2 trigger is invoked with a separate new connection where such tables are not visible.
Here is the excerpt from TriggerObject.fireRow:

        Connection c2 = session.createConnection(false);
        boolean old = session.getAutoCommit();
        boolean oldDisabled = session.setCommitOrRollbackDisabled(true);
        Value identity = session.getLastScopeIdentity();
        try {
            session.setAutoCommit(false);
            try {
                triggerCallback.fire(c2, oldList, newList);
            } catch (Throwable e) {
                throw getErrorExecutingTrigger(e);
            }
....

Any suggestions?
Thank you.

#451 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

It seems that for H2 temporary tables it is impossible to use triggers. H2 trigger is invoked with a separate new connection where such tables are not visible.
Here is the excerpt from TriggerObject.fireRow:
[...]
Any suggestions?

The current implementation of triggers in H2 seems to be a dealbreaker, in that the trigger must be fired in this case on an existing connection, as only that connection has access to all the private objects/state needed.

I am not familiar with the trigger or session/connection management code in H2 to understand off the top of my head how feasible it would be to code around this limitation.

Can you think of an alternative implementation which (a) does not require triggers; and (b) is feasible to implement in a short period of time? If not, please fall back to the UDF implementation of CONTAINS for now (for temp-tables only). We will have to consider what our options are.

#452 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

The current implementation of triggers in H2 seems to be a dealbreaker, in that the trigger must be fired in this case on an existing connection, as only that connection has access to all the private objects/state needed.

I am not familiar with the trigger or session/connection management code in H2 to understand off the top of my head how feasible it would be to code around this limitation.

Can you think of an alternative implementation which (a) does not require triggers; and (b) is feasible to implement in a short period of time? If not, please fall back to the UDF implementation of CONTAINS for now (for temp-tables only). We will have to consider what our options are.

Without triggers, we have to add the word tables population/update on every insert/update of the master table in the Java code. The code for an update of the word table is simple, but I cannot say how many placed in the existing code will require modification and how difficult is to extract the data for the update.
I suggest retaining the changes for creating word tables and related objects (w/o triggers) in the code but make them not working for now and just add implicit sorting for the '_temp' and 'dirty' databases as described #1587-423.

#453 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

Without triggers, we have to add the word tables population/update on every insert/update of the master table in the Java code. The code for an update of the word table is simple, but I cannot say how many placed in the existing code will require modification and how difficult is to extract the data for the update.
I suggest retaining the changes for creating word tables and related objects (w/o triggers) in the code but make them not working for now and just add implicit sorting for the '_temp' and 'dirty' databases as described #1587-423.

I'm sure we can adapt the orm.Persister class to allow for some trigger-like feature which could meet this need, but I would want to consider such changes very carefully, and we don't have time for this right now. For now, we need to wrap up this round of work on CONTAINS. Let's do the following as quickly as possible, and defer any further work:

  • (IAS) Clean up the current work on applying word tables to the temp-table implementation. I don't want to lose the work you've done so far, but we can't spend more time on this at the moment. I agree that we should retain the changes for creating word tables, but these should not be active (disabled or commented out and clearly marked with TODOs in the code, so we have the option to come back and finish this implementation).
  • (IAS) Do a final, summary performance run with the existing tests and document the findings in this task.
  • (ECF) Remove the subscript from cases of WHERE extent[n] CONTAINS <expr> ....
  • (ECF) Issue the 1688 conversion-time warning.

#454 Updated by Eric Faulhaber about 3 years ago

In 3821c/12143, I have added the 1688 conversion-time warning and code to remove the subscript from an extent field lvalue of the CONTAINS operator.

#455 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

I'm sure we can adapt the orm.Persister class to allow for some trigger-like feature which could meet this need, but I would want to consider such changes very carefully, and we don't have time for this right now. For now, we need to wrap up this round of work on CONTAINS. Let's do the following as quickly as possible, and defer any further work:

I see. Thank you.

  • (IAS) Clean up the current work on applying word tables to the temp-table implementation. I don't want to lose the work you've done so far, but we can't spend more time on this at the moment. I agree that we should retain the changes for creating word tables, but these should not be active (disabled or commented out and clearly marked with TODOs in the code, so we have the option to come back and finish this implementation).
  • (IAS) Do a final, summary performance run with the existing tests and document the findings in this task.
  • (ECF) Remove the subscript from cases of WHERE extent[n] CONTAINS <expr> ....
  • (ECF) Issue the 1688 conversion-time warning.

What about ordering for _temp and 'dirty'?
Thank you.

#456 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

In 3821c/12143, I have added the 1688 conversion-time warning and code to remove the subscript from an extent field lvalue of the CONTAINS operator.

I see. It works! Thank you!

#457 Updated by Igor Skornyakov about 3 years ago

After adding "warming up" to my words/words-perf.p performance test I've noticed the following. The screen remains just a white rectangle almost up to the end of the test. In particular, the initial message is not visible. With 4GL I see a normal decorated window and the initial message from the very beginning.

#458 Updated by Igor Skornyakov about 3 years ago

The results of the performance test ("warmed" mode). Last 4 columns are times in ms:
expr records 4GL PG UDF PG MIX H2
'a0' 1 0 6101 79 2
'b0' 2 0 6087 67 2
'c0' 4 0 6105 67 2
'g0' 64 1 6005 104 4
'g0 & g1' 32 0 6053 302 2
'h0' 128 1 5852 108 4
'h0 & h1' 64 0 5609 326 2
'h0 & h1 & h2' 32 0 5616 107 3
'r0' 131072 521 16665 12306 5061
'r0 & r1' 65536 572 13507 7058 3202
'r0 & r1 & r2' 32768 533 11555 4276 2887
's0' 262144 1046 28736 24809 20268
's0 & s1' 131072 1228 17465 13888 6795
's0 & s1 & s2' 65536 1067 13528 8670 6266
't0' 0 0 5273 238 1
't0 & t1' 0 0 5534 5 1
't0 & t1 & t2' 0 0 6177 6 1

#459 Updated by Igor Skornyakov about 3 years ago

A refactored version of the word tables' support and creation of word tables for the _temp database (currently disabled) committed to 3821c/12146.

Please note the implicit sorting imposed by CONTAINS for _temp and 'dirty' databases (#1587-423) is not implemented yet.

#460 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

What about ordering for _temp and 'dirty'?

If you can get this done today or tomorrow, please proceed.

#461 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

Igor Skornyakov wrote:

What about ordering for _temp and 'dirty'?

If you can get this done today or tomorrow, please proceed.

OK. Thank you. It should not be difficult.

#462 Updated by Igor Skornyakov about 3 years ago

I've realized that the implicit ordering described in #1587-407 is not correct and can result in the repeated records in the result set in some situations.

Consider the following SELECT query:

SELECT <flist>
FROM <from>
WHERE <where>
ORDER BY <order>

Where <flist> is the fields' list, <from> is the the FROM expression, <where> is the WHERE predicate, and <order> is a list of fields used for ordering.
Hereafter expressions in curly brackets are tables/fields names/aliases.
Let's assume that <where> contains a predicate
C = CONTAINS({T}.{f}, <expr>)

And the corresponding word index was selected for implicut ordering (at this moment it means the in the orofinal statement <order> is {T}.recid and C is the first CONTAINS in <where> for {T}.{f}).

Using DNF, we can re-write <where> as C[AND W1] [OR (NOT C)[AND W2]] [OR W3], where predicates W1, W2 W3 do not contain C.
The ordering described in #1587-407 is correct if W3 is absent or depends only on the other CONTAINS. Otherwise, it can result in duplicated records in the result set since we add {cte} to <from>, which means CROSS JOIN.

It is possible to add the analysis and re-writing of the <where> predicate, but it can take time. So I suggest the following algorithm.

Let <expr> = >P(t1, t2, ..., tn)> where terms ti are ordered according to the first position where that found in <expr>.

If the field {f} is not an extent one the re-written query is

WITH 
{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    JOIN {wtable} AS {walias1} ON ({walias1}.parent__id = t.recid AND (<or group1>)) 
    JOIN {wtable} AS {walias2} ON ({walias2}.parent__id = t.recid AND (<or group2>)) 
    ....
    JOIN {wtable} AS {waliask} ON (<walias2>.parent__id = t.recid AND (<or groupk>)) 
GROUP BY recid
),
{mcte} AS (
SELECT <flist>, recid
FROM <from>
WHERE {where'}>
)
SELECT <flist'> FROM {mcte}
               LEFT JOIN <cte> ON ({mcte}.recid = <cte>.recid)
ORDER BY coalesce({cte}.{w1}, -1) desc, coalesce({cte}.{w2}, -1) desc, ..., coalesce({cte}.{wn}, -1) desc,
         {mcte}.recid

Here <where'> is <where> predicate with C replaced by C' = {mcte}.recid IN (SELECT recid FROM {cte}) and <flist'> is <flist> with table aliases replaced with {mcte}.

<or groupi> and <wej> are defined as following:
Let P(t1, t2,...) = (t11 | t12 | ...) & (t21 | t22 | ...) ... & (tk1 | tk2 | ...) is CNF Of P. Terms tij an are one of ti. Sub-terms tij in every AND term are ordered according the ordering of ti and AND terms are ordered lexicographically.
<or groupi> is ti1' OR ti2' OR ... where tij' is either waliasi}.word = tij or waliasi}.word LIKE tij.
The weight expression <wei> = SUM(CAST tpq' as int)) where tpq' is the first term in CNF which corresponds to ti.

If the field {f} is an extent one the re-written query is:

WITH 
{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    JOIN {ET} e ON (e.parent__id = .recid)
    JOIN {wtable} AS {walias1} ON ({walias1}.parent__id = t.recid AND {walias1}.list__index  = e.list__index AND (<or group1>)) 
    JOIN {wtable} AS {walias2} ON ({walias2}.parent__id = t.recid AND {walias2}.list__index  = e.list__index AND (<or group2>)) 
    ....
    JOIN {wtable} AS {waliask} ON (<walias2>.parent__id = t.recid AND {waliask}.list__index  = e.list__index  AND (<or groupk>)) 
GROUP BY recid
),
{mcte} AS (
SELECT <flist>, recid
FROM <from>
WHERE {where'}>
)
SELECT <flist'> FROM {mcte}
               LEFT JOIN <cte> ON ({mcte}.recid = <cte>.recid)
ORDER BY coalesce({cte}.{w1}, -1) desc, coalesce({cte}.{w2}, -1) desc, ..., coalesce({cte}.{wn}, -1) desc,
         {mcte}.recid

Here {ET} is the extent table.

For _temp and 'dirty' databases where CONTAINS UDF is used the re-written query looks the same, but with different {cte}. For non-extent field we use:

{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    WHERE CONTAINS(t.{f}, expr) 
)

Here <wei> = CONTAINS(t.{f}, ti).

For extent field:

{cte} AS (
SELECT <we1> AS {w1}, <we2> AS {w2}, ..., <wn>  AS {wn}, recid
FROM {T} t
    JOIN {ET} e ON (e.parent__id = t.recid AND CONTAINS(e.{f}, expr))
GROUP BY recid
)

Here <wei> = SUM(CAST(CONTAINS(e.{f}, ti)) as int)).

#463 Updated by Igor Skornyakov about 3 years ago

Re-worked implicit ordering for word tables. See #1587-462.
Committed to 3821c/12157.

#464 Updated by Igor Skornyakov about 3 years ago

Implicit sorting for _temp and 'dirty' databases is implemented. Will be committed tomorrow after additional testing (functional and performance). Standard test words/words-tmp.p passed OK.

#465 Updated by Igor Skornyakov about 3 years ago

Re-worked ordering looks a little slower than before. However, it is still much faster than with UDF.

expr records PG MIX PG UDF H2
'a0' 1 106 10408 4
'b0' 2 96 10424 4
'c0' 4 88 10322 4
'g0' 64 148 15605 7
'g0 & g1' 32 167 15742 6
'h0' 128 146 15517 6
'h0 & h1' 64 161 15434 7
'h0 & h1 & h2' 32 151 16096 10
'r0' 131072 12917 41347 7795
'r0 & r1' 65536 7392 35869 7132
'r0 & r1 & r2' 32768 4783 31987 6300
's0' 262144 25351 57671 26770
's0 & s1' 131072 15152 45597 14839
's0 & s1 & s2' 65536 9408 38245 12998
't0' 0 6 5237 2
't0 & t1' 0 5 5228 2
't0 & t1 & t2' 0 6 5330 3

All standard tests passed OK as well as a "smoke" test of the large customer app.
Committed to 3821c/12161
Please review.
Thank you.

#466 Updated by Igor Skornyakov about 3 years ago

I've just found a stupid bug in the schema_word_tables_<db>_<dialect>.sql script generation. It is not a problem for initial database import but the script is not idempotent (cannot be run multiple times).
The problem is that it contains the lines create index <indexName> on <tableName> (word); twice. The first of such lines should be drop index if exists <indexName>;

For Postgres 10+ the there is a 'quick and dirty' workaround w/o re-generating DDL. Just replace all 'create index ' strings with 'create index if not exists ' with any editor or sed.

Fixed in 3821c/12200.

Sorry about this.

#467 Updated by Igor Skornyakov about 3 years ago

Found a regression in the denormalized extents' support.
Working on this.

#468 Updated by Eric Faulhaber about 3 years ago

What is the problem and do you have an idea about the level of effort required to fix it?

#469 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

What is the problem and do you have an idea about the level of effort required to fix it?

There was NPE if the denormalized extent field is selected for implicit ordering.

Fixed in 3821c/12203.

#470 Updated by Igor Skornyakov about 3 years ago

Merged wors tables tests' data to the fwd test database. Performed cleanup of the tests.

Committed to xfer.goldencode.com/opt/testcases/ rev. 1032.

Started working on documentation.

BTW: maybe it makes sense to add documentation to the Wiki? If so, then to what section?
Thank you.

#471 Updated by Igor Skornyakov about 3 years ago

Word tables and related database objects

For the table {T} and field {f} with word index on it the word table is defined as (hereafter we provide SQL statements for the PostreSQL dialect):

create table {T}__{f} (
   parent__id int8 not null,
   word text not null
);

for non-extent {f}.
For extent {f}:
create table {T}__{f} (
   parent__id int8 not null,
   list__index int4 not null,
   word text not null
);

The updates if the word tables on inserts/updates of {T} is maintained by triggers that use the following auxiliary functions:

create or replace function words (
      recid int8, txt text, toUpperCase boolean
   )
   returns table ( parent int8, word text ) language 'plpgsql' as
   $$
   declare arr text[];
   declare w text;
   begin
       arr = words(txt, false, toUpperCase);
       foreach w in array arr loop
           parent = recid;
           word = case when touppercase then UPPER(w) else w end;
           return next;
       end loop;
   end
   $$;

create or replace function words (
      recid int8, listidx int4, txt text, toUpperCase boolean
   )
   returns table ( parent int8, idx int4, word text ) language 'plpgsql' as
   $$
   declare arr text[];
   declare w text;
   begin
       arr = words(txt, false, toUpperCase);
       foreach w in array arr loop
           parent = recid;
           idx = listidx;
           word = case when touppercase then UPPER(w) else w end;
           return next;
       end loop;
   end
   $$;


Here words(String data, boolean toUpperCase, boolean forCaseInsensitive) is a PL/Java UDF for splitting text into words.

The trigger function for a table {T} and field {f} is:

create or replace function {T}__{f}__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.recid;
    insert into {T}__{f} select * from words(new.recid, new.{f}, true);
    return new;
end;
$$;

This function is used in the following triggers:

create trigger {T}__{f}__upd after
update of {f}
    on
    {T} for each row execute procedure {T}__{f}__trg();

create trigger {T}__{f}__ins after
insert
    on
    {T} for each row execute procedure {T}__{f}__trg();

If {f} is a normalized extent field, the trigger function is:

create or replace function ttwi__f_ext__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.parent__id and {T}__{f}.list__index = new.list__index;
    insert into {T}__{f} select * from words(new.parent__id, new.list__index, new.{f}, true);
    return new;
end;
$$;

The triggers are defined for the corresponding extent table {TE} = {T}__<extent size>:

create trigger {T}__{f}__upd after
update of {f}
    on
    {TE} for each row execute procedure {T}__{f}__trg();

create trigger {T}__{f}__ins after
insert
    on
    {TE} for each row execute procedure {T}__{f}__trg();

For the denormalized extent field {f} the situation is a little more complicated.

Let {i} be a position of the extent component and {fi} is the corresponding field name:
The trigger function for this field is:

create or replace function {T}__{fi}__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.recid and {T}__{f}.list__index = {i|;
    insert into {T}__{f} select * from words(new.recid, {i}, new.{T}__{fi}, true);
    return new;
end;
$$;

The UPDATE trigger is defined for every {fi} field:

create trigger {T}__{fi}__upd after
update of {fi}
    on
    {T} for each row execute procedure {T}__{fi}__trg();

The INSERT trigger is one:

create trigger {T}__{f}__ins after
insert
    on
    {T} for each row execute procedure {T}__{f}__trg();

It uses trigger function:

create or replace function {T}__{f}__trg()
returns trigger
language plpgsql
as
$$
begin
    delete from {T}__{f} where {T}__{f}.parent__id = new.recid;
    insert into {T}__{f} select * from words(new.recid, 1, new.{f1}, true);
    insert into {T}__{f} select * from words(new.recid, 2, new.{f2}, true);
    ....
    insert into {T}__{f} select * from words(new.recid, {n}, new.{fn}, true);
    return new;
end;
$$;

where n is the size of the extent.

We define the following indices/constrains for the word table _{f}.
For non-extent {f}:

alter table {T}_{f}
   add constraint pk__{T}_{f}
   primary key (parent__id, word);

create index fkidx__{T}_{f} on {T}_{f} (parent__id);

create index idx__{T}_{f} on {T}_{f} (word);

alter table {T}_{f}
   add constraint fk__{T}_{f}
   foreign key (parent__id)
   references {T}
   on delete cascade;

For extent {f} (both normalized and denormalized):

alter table {T}_{f}
   add constraint pk__{T}_{f}
   primary key (parent__id, list__index, word);

create index fkidx__{T}_{f} on {T}_{f} (parent__id);

create index idx__{T}_{f} on {T}_{f} (word);

alter table {T}_{f}
   add constraint fk__{T}__{f}
   foreign key (parent__id)
   references {T}
   on delete cascade;

The definition of triggers, indices, and constraints for all word tables in the database are located in a separate SQL script schema_word_tables_{db}_{dialect}.sql.
On the database import, this script is applied after the population of tables. However, the script is idempotent and can be applied multiple times.

Sample re-written queries.

Consider the following 4GL table:
ADD TABLE "ttwi"
AREA "Schema Area"
DUMP-NAME "ttwi"

ADD FIELD "if1" OF "ttwi" AS character
FORMAT "x(128)"
INITIAL ""
POSITION 2
MAX-WIDTH 256
ORDER 10

ADD FIELD "if2" OF "ttwi" AS character
FORMAT "x(8)"
INITIAL ""
POSITION 3
MAX-WIDTH 16
ORDER 20
CASE-SENSITIVE

ADD FIELD "f-ext" OF "ttwi" AS character
FORMAT "x(50)"
INITIAL ""
POSITION 4
MAX-WIDTH 510
EXTENT 5
ORDER 30

ADD INDEX "wi1" ON "ttwi"
AREA "Schema Area"
WORD
INDEX-FIELD "if1" ASCENDING

ADD INDEX "wi2" ON "ttwi"
AREA "Schema Area"
INACTIVE
WORD
INDEX-FIELD "if2" ASCENDING

ADD INDEX "wiext" ON "ttwi"
AREA "Schema Area"
INACTIVE
WORD
INDEX-FIELD "f-ext" ASCENDING

Consider also the table ttwid with the same structure but that is converted using denormalized extent field f-ext.
Below we provide sample 4GL queries and corresponding re-written SQL statements:

for each ttwi where (if2 contains 'big & elephant | small & rabbit') or (if1 '(cat* | ant*)'):

with
wcte1 as (
     select sum(cast(w1.word = 'big' as int)) as w11, sum(cast(w3.word = 'elephant' as int)) as w32, sum(cast(w1.word = 'small' as int)) as w13, sum(cast(w2.word = 'rabbit' as int)) as w24, recid
    from ttwi t
        join ttwi__if2 as w1 on (w1.parent__id = t.recid and (w1.word = 'big' or w1.word = 'small'))
        join ttwi__if2 as w2 on (w2.parent__id = t.recid and (w2.word = 'big' or w2.word = 'rabbit'))
        join ttwi__if2 as w3 on (w3.parent__id = t.recid and (w3.word = 'elephant' or w3.word = 'small'))
        join ttwi__if2 as w4 on (w4.parent__id = t.recid and (w4.word = 'elephant' or w4.word = 'rabbit'))
    group by recid
)
,
wcte2 as (
     select distinct recid
    from ttwi t
        join ttwi__if1 as w1 on (w1.parent__id = t.recid and (w1.word like UPPER('cat%') or w1.word like UPPER('ant%')))
)
,
mcte3 as (
select 
    ttwi__impl0_.recid as col_0_0_, ttwi__impl0_.recid as col_recid_ 
from
    ttwi ttwi__impl0_ 
where
    (ttwi__impl0_.recid in (select recid from wcte1)) or (ttwi__impl0_.recid in (select recid from wcte2))
)
select mcte3.col_0_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)

order by coalesce(wcte1.w11, -1)  desc,coalesce(wcte1.w32, -1)  desc,coalesce(wcte1.w13, -1)  desc,coalesce(wcte1.w24, -1)  desc,mcte3.col_recid_

for each ttwi where (f-ext contains ''big & elephant | small & rabbit') or (f-ext contains 'word22a & word22b | word23a & word23b'):

with
wcte1 as (
     select sum(cast(w1.word like UPPER('word1%') as int)) as w11, sum(cast(w1.word like UPPER('word3%') as int)) as w12, sum(cast(w2.word = UPPER('word12c') as int)) as w23, sum(cast(w2.word = UPPER('word33c') as int)) as w24, recid
    from ttwi t
    join ttwi__5 e on (e.parent__id = t.recid)
        join ttwi__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word like UPPER('word1%') or w1.word like UPPER('word3%')))
        join ttwi__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = UPPER('word12c') or w2.word = UPPER('word33c')))
    group by recid
)
,
mcte3 as (
select 
    ttwi__impl0_.recid as col_0_0_, ttwi__impl0_.recid as col_recid_ 
from
    ttwi ttwi__impl0_ 
where
    (ttwi__impl0_.recid in (select recid from wcte1)) or (ttwi__impl0_.recid in (
     select distinct recid
    from ttwi t
    join ttwi__5 e on (e.parent__id = t.recid)
        join ttwi__f_ext as w1 on (w1.parent__id = e.parent__id and w1.list__index  = e.list__index and (w1.word = UPPER('word22a') or w1.word = UPPER('word23a')))
        join ttwi__f_ext as w2 on (w2.parent__id = e.parent__id and w2.list__index  = e.list__index and (w2.word = UPPER('word22a') or w2.word = UPPER('word23b')))
        join ttwi__f_ext as w3 on (w3.parent__id = e.parent__id and w3.list__index  = e.list__index and (w3.word = UPPER('word22b') or w3.word = UPPER('word23a')))
        join ttwi__f_ext as w4 on (w4.parent__id = e.parent__id and w4.list__index  = e.list__index and (w4.word = UPPER('word22b') or w4.word = UPPER('word23b')))
))
)
select mcte3.col_0_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)
order by coalesce(wcte1.w11, -1)  desc,coalesce(wcte1.w12, -1)  desc,coalesce(wcte1.w23, -1)  desc,coalesce(wcte1.w24, -1)  desc,mcte3.col_recid_

for each ttwid where (f-ext[1] contains '(word1* | word3*) & (word12c | word33c)') or (f-ext contains 'word22a & word22b | word23a & word23b'):

with
wcte1 as (
     select sum(cast(w1.word like UPPER('word1%') as int)) as w11, sum(cast(w1.word like UPPER('word3%') as int)) as w12, sum(cast(w2.word = UPPER('word12c') as int)) as w23, sum(cast(w2.word = UPPER('word33c') as int)) as w24, recid
    from ttwid t
    join (select distinct parent__id, list__index from ttwid__f_ext) e on (e.parent__id = t.recid)
        join ttwid__f_ext as w1 on (w1.parent__id = t.recid and (w1.word like UPPER('word1%') or w1.word like UPPER('word3%')))
        join ttwid__f_ext as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('word12c') or w2.word = UPPER('word33c')))
    group by recid
)
,
mcte3 as (
select 
    ttwid__imp0_.recid as col_0_0_, ttwid__imp0_.recid as col_recid_ 
from
    ttwid ttwid__imp0_ 
where
    (ttwid__imp0_.recid in (select recid from wcte1)) or (ttwid__imp0_.recid in (
     select distinct recid
    from ttwid t
    join (select distinct parent__id, list__index from ttwid__f_ext) e on (e.parent__id = t.recid)
        join ttwid__f_ext as w1 on (w1.parent__id = t.recid and (w1.word = UPPER('word22a') or w1.word = UPPER('word23a')))
        join ttwid__f_ext as w2 on (w2.parent__id = t.recid and (w2.word = UPPER('word22a') or w2.word = UPPER('word23b')))
        join ttwid__f_ext as w3 on (w3.parent__id = t.recid and (w3.word = UPPER('word22b') or w3.word = UPPER('word23a')))
        join ttwid__f_ext as w4 on (w4.parent__id = t.recid and (w4.word = UPPER('word22b') or w4.word = UPPER('word23b')))
))
)
select mcte3.col_0_0_
from mcte3
    left join wcte1 on (mcte3.col_recid_ = wcte1.recid)
order by coalesce(wcte1.w11, -1)  desc,coalesce(wcte1.w12, -1)  desc,coalesce(wcte1.w23, -1)  desc,coalesce(wcte1.w24, -1)  desc,mcte3.col_recid_

#472 Updated by Eric Faulhaber about 3 years ago

Igor Skornyakov wrote:

BTW: maybe it makes sense to add documentation to the Wiki? If so, then to what section?

For the "how it works" documentation you just posted, you could add a page to the Internals wiki page in the FWD documentation, which eventually will be organized into a book.

For the performance testing setup and use documentation, please post it in this issue for now.

#473 Updated by Igor Skornyakov about 3 years ago

Eric Faulhaber wrote:

For the "how it works" documentation you just posted, you could add a page to the Internals wiki page in the FWD documentation, which eventually will be organized into a book.

For the performance testing setup and use documentation, please post it in this issue for now.

OK. I will do it on the weekend.
Thank you.

#474 Updated by Igor Skornyakov about 3 years ago

Added documentation for the new CONTAINS support to the 'Internals' section of Wiki. See Word Index and CONTAINS Operator

#475 Updated by Igor Skornyakov about 3 years ago

For the performance testing (words/words-perf.p) I use the following data:

The test table definition is:

ADD TABLE "words" 
  AREA "Schema Area" 
  DUMP-NAME "words" 

ADD FIELD "recno" OF "words" AS integer 
  FORMAT "->,>>>,>>9" 
  INITIAL "0" 
  POSITION 2
  MAX-WIDTH 4
  ORDER 10
  MANDATORY

ADD FIELD "words" OF "words" AS character 
  FORMAT "x(1000)" 
  INITIAL "" 
  POSITION 3
  MAX-WIDTH 200
  ORDER 20
  MANDATORY

ADD INDEX "pk" ON "words" 
  AREA "Schema Area" 
  PRIMARY
  INDEX-FIELD "recno" ASCENDING 

ADD INDEX "words" ON "words" 
  AREA "Schema Area" 
  WORD
  INDEX-FIELD "words" ASCENDING 

Consider a number N > 0. The values of the pk field are numbers 0..2^*(N-1). The value of the words field with pk == n is the following:
Please note that 2^*N-1 = 2^0 + 2^1 + 2^2 + ... _ 2 ^(N-1). For every m = 0..(N-1) we generate 'words' which are all non-empty subsets of 0..2^m - 1 with a one-letter prefix which is " abcdefghijklmnjpqrstuvwxyz".charAt(m).

The words are generated for L = 1..2^m like this:
Let L = 2^k1 + 2^k2 + ... + 2^kl - a binary representation of L. Then the value of words is a space separated prefixed "words" <prefix>String.valueOf(k[i]), i = 1..l.
This means that the word index is created for a field containing all possible subsets of the set of n "words@.
This allows creating queries with different (and easily predictable) result sets. The structure of the data is a good model of any real data if we ignore the performance implications of the long words and the duplicated set of words that can be found in 'real' data. In some well-defined sense, it represents a "maximal" version of a collection of records with a given set of words.

With N = 20 we will have a table with more than a million records that look much enough to reason about the performance of the approach.

The data is generated by Java program. The corresponding Eclipse project is in words/java/data-gen/tests subfolder. It accepts two arguments - N and <output file name>. The first argument is mandatory. If the second argument is not provided, stdout is used for output.

The data used for the performance testing were generated with N = 20. The total number of records, in this case, is 1048555.

#476 Updated by Eric Faulhaber about 3 years ago

  • Status changed from WIP to Test
  • % Done changed from 0 to 100

#477 Updated by Greg Shah about 3 years ago

  • Status changed from Test to Closed

#478 Updated by Eric Faulhaber over 2 years ago

Igor, have you tested the performance of the database-native words UDF vs. that of the PL/Java implementation of the same function? This was an unplanned (at least by me) dependency on the UDF implementation. I'd like to make the native words UDF the default for word index support, but first I want to have a better understanding of the relative update performance of word indexed fields with this newer implementation.

#479 Updated by Igor Skornyakov over 2 years ago

Eric Faulhaber wrote:

Igor, have you tested the performance of the database-native words UDF vs. that of the PL/Java implementation of the same function? This was an unplanned (at least by me) dependency on the UDF implementation. I'd like to make the native words UDF the default for word index support, but first I want to have a better understanding of the relative update performance of word indexed fields with this newer implementation.

Eric,
Yes, I did. See e.g. #1587-465

#480 Updated by Igor Skornyakov over 2 years ago

Igor Skornyakov wrote:

Eric Faulhaber wrote:

Igor, have you tested the performance of the database-native words UDF vs. that of the PL/Java implementation of the same function? This was an unplanned (at least by me) dependency on the UDF implementation. I'd like to make the native words UDF the default for word index support, but first I want to have a better understanding of the relative update performance of word indexed fields with this newer implementation.

Eric,
Yes, I did. See e.g. #1587-465

Sorry. If you mean just words UDFs, not CONTAINS support, then it was not included in the performance tests.
However, the native implementation is a one-liner and I do not expect that there is a big difference in performance. Please note also that these UDFs are used only in triggers (on the corresponding field insert/update), and will be rarely applied to many records at a time.

I believe that updating the word table index on such operations will be much more expensive than the words UDF call.

#481 Updated by Ovidiu Maxiniuc almost 2 years ago

I encountered in a customer's data dump the following situation:

The .df file looks like this:

    ADD TABLE "abc-table" 
    [...]
    ADD FIELD "keyindex" OF "abc-table" AS character
    [...]
    ADD INDEX "k-keyindex" ON "abc-table" 
      AREA "wordindexarea_idx" 
      WORD
      INDEX-FIELD "keyindex" ASCENDING

One on line, the abc-table.d has the following content:

"94166c9c-dbf7-c7b9-1314-d3f31c187f6d" "abc_16 abc_30 30 abcd_0 ababa_5110 5110 abcab_0 abde_PO_Add_on_Abcf PO Add on Abc ababab_PO_Add_on_abc PO Add on AbaAb Add 16_15_2029 17_47_10_881_00_00"

Notice that there are 3 occurrences of the word Add.

The problem is that the import fails with the following stack trace:

     [java] abc-table.d: Failed to populate word table abc_table__keyindex: pk = 20507.
     [java] org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "pk__abc_table__keyindex" 
     [java]   Detail: Key (parent__id, word)=(20507, ADD) already exists.
     [java]     at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
     [java]     at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183)
     [java]     at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
     [java]     at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
     [java]     at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
     [java]     at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:143)
     [java]     at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:120)
     [java]     at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeUpdate(NewProxyPreparedStatement.java:384)
     [java]     at com.goldencode.p2j.schema.ImportWorker$Library.populateWordTable(ImportWorker.java:1708)
     [java]     at com.goldencode.p2j.schema.ImportWorker$Library.populateWordTables(ImportWorker.java:1643)
     [java]     at com.goldencode.p2j.schema.ImportWorker$Library.importTable(ImportWorker.java:1444)
     [java]     at com.goldencode.p2j.schema.ImportWorker$Library.lambda$importAsync$3(ImportWorker.java:1937)
     [java]     at java.lang.Thread.run(Thread.java:748)

My questions are:
  • what should FWD import do in this case?
  • does it work correctly at runtime, when a similar value is saved to database?

Also available in: Atom PDF