Project

General

Profile

Bug #6837

prevent unnecessary upper/rtrim injection in queries if the data type doesn't require it

Added by Eric Faulhaber over 1 year ago. Updated about 1 year ago.

Status:
WIP
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
Due date:
% Done:

0%

billable:
No
vendor_id:
GCD
case_num:

Related issues

Related to FWD - Feature #2949: Alternative approach to handling case insensitivity New
Related to Database - Bug #7108: Simulate upper/rtrim directly in H2 using case specific columns Closed

History

#1 Updated by Eric Faulhaber over 1 year ago

In some cases, we are using upper and/or rtrim in WHERE and ORDER BY clauses to adjust textual columns that don't need it, due to the nature of the data type. Currently, this occurs in "lenient" mode in the MariaDB dialect for varchar columns, but it would also occur if we switched from text to citext in PostgreSQL, which is something I'd like to do.

The insertion of upper occurs at least during conversion, possibly at runtime. The insertion of rtrim occurs at runtime in the HQLPreprocessor.

Using these functions improperly can result in bad query plans, since indices will not match the column references in the WHERE or ORDER BY clauses.

I think the appropriate place to determine whether these functions should be injected is during FQL -> SQL conversion, with consultation to the active Dialect subclass.

#2 Updated by Ovidiu Maxiniuc over 1 year ago

Eric Faulhaber wrote:

The insertion of upper occurs at least during conversion, possibly at runtime. The insertion of rtrim occurs at runtime in the HQLPreprocessor.

In some cases, the conversion may be aware of some additional information, like when comparing a char string with a case-sensitive variable. If we do not do this at conversion time whenthe information on both operands is available, at runtime this might be impossible (because the variable -- in this case -- will be stored in a list of substitution elements, possible lazily evaluated).

I think the appropriate place to determine whether these functions should be injected is during FQL -> SQL conversion, with consultation to the active Dialect subclass.

There is more. We already have cases where the same field type (character in ABL) is converted to different SQL type. For example, a field is usually converted to unrestricted text in PSQL, but in MariaDb it may be varchar(n) or text, if conversion decides the former is too big. This information is NOT available in DMO annotation; it only contain metadata common to all dialects.

#4 Updated by Ovidiu Maxiniuc over 1 year ago

The dropping of the upper() / rtrim() functions injected during conversion must be done anyway. Initially, it will be done for all characters fields when the new MariaDbLenientDialect is detected, which will also propagate to MariaDbDialect. Later, after adding some kind of registry/collection on which we store the SQL columns types for each 4GL field, the operation will be more accurate, affected only the fields which require this. Even if this would be theoretically possible, we will not store this kind of information in DMO annotation (at least for 2 reasons: it is dialect dependent; the DDL is generated at a later/unrelated time during conversion).

Evidently, his will work incorrectly/slow for text fields, but their count is far small than varchar() s, and this is only a transition period.

For the "right way" implementation, we eventually need to talk how to store the missing type information: DMO annotation, some XML, or a local h2 database. For a similar issue in H2, we come up with __i and __s prefixes. But this is not really applicable here.

#5 Updated by Eric Faulhaber over 1 year ago

I agree that we don't want dialect-specific metadata in DMO annotations. OTOH, I think it makes sense to store that metadata in DmoMeta instances at runtime, so it is available when needed to process a query.

The source of that information could be JDBC metadata, collected at server startup. If that data collection is slow, we could keep server startup relatively fast (important for development/testing, if not production) by launching a separate thread per database to do this work.

In the event the metadata is not yet available the first time it is needed to process a query, it could be collected from the query processing code as well, lazily.

#6 Updated by Ovidiu Maxiniuc over 1 year ago

  • Status changed from New to WIP

Eric Faulhaber wrote:

I agree that we don't want dialect-specific metadata in DMO annotations. OTOH, I think it makes sense to store that metadata in DmoMeta instances at runtime, so it is available when needed to process a query.

DmoMeta could be a solution, even if it was also designed to be dialect independent. This class is used not only with the primary database but also with the dirty one. However, each dialect will be interested in specific properties and ultimately, they should live together.

The source of that information could be JDBC metadata, collected at server startup. If that data collection is slow, we could keep server startup relatively fast (important for development/testing, if not production) by launching a separate thread per database to do this work.
In the event the metadata is not yet available the first time it is needed to process a query, it could be collected from the query processing code as well, lazily.

If we go this way, the request for metadata from SQL server could be dialect independent (I am thinking of MariaDb's describe <table> / show [full] columns from <table>) if they prove to be faster than JDBC's DatabaseMetaData APIs. More than that, this background thread will be started only for specific databases. And we need to implement proper synchronizations.

#7 Updated by Ovidiu Maxiniuc over 1 year ago

I committed an intermediary solution. The occurrences of properties are no longer wrapped in upper() and/or rtrim() if the dialect declared that it supports string data-types similar to 4GL. Also, no wrapping occur when the sorting criteria as composed in the order-by clause.

This is an intermediary commit because the final version must be aware of the actual type of the columns as not all strings in MariaDb behave the same.

Committed revision 14317.

#8 Updated by Greg Shah over 1 year ago

  • Related to Feature #2949: Alternative approach to handling case insensitivity added

#9 Updated by Greg Shah over 1 year ago

Please ensure that the solution is built such that all dialects (not just MariaDB) can leverage this same approach. We have other requests for this (e.g. see #2949 for the request from a customer that uses both PostgreSQL and SQLServer). We are potentially considering these changes very soon (see #6898).

#10 Updated by Eric Faulhaber over 1 year ago

What is left to do on the MariaDB side for:

  • lenient mode?
  • strict mode?

#11 Updated by Ovidiu Maxiniuc about 1 year ago

  • Subject changed from prevent unnecessary upper/rtim injection in queries if the data type doesn't require it to prevent unnecessary upper/rtrim injection in queries if the data type doesn't require it

The current implementation injects the upper/rtrim into FQLs of the queries at conversion time because it assumes at one of the dialects the application will be run on will not have automatic handling of CI/trimming like 4GL.

Later, at runtime, based on the value returned isAutoRtrimAndCi of the current dialect, the upper/rtrim functions are dropped while the FQL is preprocessed. From this POV, both MariaDb dialects do not need the functions to be injected because they both use the case-insensitive varchar(N) which ignores right-padding spaces.

If, for other dialects a similar data type is used for same reason, it is enough to override/implement isAutoRtrimAndCi and return true. Of course, the DDL generator should also be modified so that the indices to be composed of the pure column, not an expression or a computed column.

#12 Updated by Eric Faulhaber about 1 year ago

It sounds like this task can be closed, then, correct?

#13 Updated by Greg Shah about 1 year ago

It seems like note 11 is suggesting some changes to encode the logic in the dialects and fix the DDL generation.

#14 Updated by Ovidiu Maxiniuc about 1 year ago

The changes are necessary only if/when we decide to switch from case-sensitive character data types to case-insensitive. The changes are localized within the respective dialect implementation (override isAutoRtrimAndCi and generate proper column types in tables). There is nothing to do ATM.

Note:
I wanted to manually test the behaviour in PSQL and I tried creating a table having a citext column. All I get is:

ERROR:  type "citext" does not exist
I understand this is a new data type (PSQL 11 and newer). I am using psql (PostgreSQL) 14.2 (Ubuntu 14.2-1.pgdg20.04+1). Do I need to enable it somehow?

#15 Updated by Eric Faulhaber about 1 year ago

I'm pretty sure citext existed as an extension before PostgreSQL 11. I thought it was one of the core data types after that, though. Did you try something like this?

CREATE EXTENSION IF NOT EXISTS citext WITH SCHEMA public;

If this is necessary in the most recent versions, I wonder what that means for cloud implementations like Aurora, in terms of what they do and do not allow.

#16 Updated by Ovidiu Maxiniuc about 1 year ago

Thank you. I tried it now and it worked. I was able to create a table and execute a quick select statement over a couple of inserted records. So it's seems that we will not have to inject the upper() function.

However, unlike MariaDB's strings, the right padding spaces are taken into account so the rtrim() still needs to be injected. From my tests: 'XYz' = 'xYz' but 'ABc' <> 'aBc ' if they are stored in citext fields.

And the not so good news do do stop here. Wrapping the case-insensitive citext in rtrim() will bring us back to square 1. That because the result of rtrim(citext) = text. With the above example, rtrim('ABc') = 'ABc' and rtrim('aBc ') = 'aBc'. But 'ABc' <> 'aBc' since they are now text, not citext :(.

Unless this can be configured somewhere else. Or if there is another data type for this?

#17 Updated by Eric Faulhaber about 1 year ago

Ovidiu Maxiniuc wrote:

Unless this can be configured somewhere else. Or if there is another data type for this?

I don't think so. It sounds like citext may not meet our needs, then.

Another option we could look into is combining the use of citext with custom operators, which would perform comparisons of their operands after applying a right-trim operation (presumably "manually" since as you note, the rtrim builtin function will return a text type in cases when we want citext).

PostgreSQL allows custom operators to be defined, but...

  • I don't know the details of the implementation; I have not tried this before;
  • it is unclear whether this is an allowable customization for cloud implementations, like Aurora.

#18 Updated by Alexandru Lungu about 1 year ago

  • Related to Bug #7108: Simulate upper/rtrim directly in H2 using case specific columns added

Also available in: Atom PDF