Project

General

Profile

Bug #8593

H2 performance improvement for table 'truncate by multiplex'

Added by Constantin Asofiei 3 months ago. Updated about 7 hours ago.

Status:
Internal Test
Priority:
Normal
Target version:
-
Start date:
Due date:
% Done:

100%

billable:
No
vendor_id:
GCD
case_num:
version_reported:
version_resolved:

Related issues

Related to Database - Bug #8703: NO-UNDO option at DMO must be removed New

History

#2 Updated by Constantin Asofiei 3 months ago

  • Status changed from New to Review
  • Assignee set to Constantin Asofiei
  • % Done changed from 0 to 100
Created task branch 8593a_h2 which improves the DELETE FROM tt1 WHERE tt1._multiplex = ?. The reasoning is this:
  • normal DELETE operations in H2 will first do a SELECT to gather all the rows. This includes DELETE FROM tt1 WHERE tt1._multiplex = ?
  • with normal DELETE, each row is deleted individually from the index - this means that all the index columns are used to find the target node to be removed
  • the implementation first checks that the DELETE is for a _MULTIPLEX = ? condition, and if so, then:
    • the MultiplexedScanIndex is cleaned first, and gathers all PKs for the records being deleted with this multiplex ID
    • all indexes which have as first index column the _MULTIPLEX column will remove all these records, by matching only the multiplex ID
    • all remaining indexes assume the PK is the first column (I need confirmation that this will be always true), and just find that record in the index and remove it.
    • this avoids using all the index columns to find the record

Alexandru: please take a look at 8593a_h2.

#3 Updated by Alexandru Lungu 3 months ago

Created task branch 8593a_h2 which improves the DELETE FROM tt1 WHERE tt1._multiplex = ?. The reasoning is this:
  • all remaining indexes assume the PK is the first column (I need confirmation that this will be always true), and just find that record in the index and remove it.

If you meant PK as recid, this is always the last column in an index, but the first column in the row. If you meant PK as H2 internal row key, this is kept outside the data array and I think it is not ever indexed.
This being said, you can't remove a row only based on the recid/PK from a tree index. You can however prune the tree index based on multiplex (as it is the first field in the index).
Please let me know if I misunderstood your intent. Without a full-scan of the index, you can't find a row solely on its PK.

One last important thing is that you still need to log the deletes. so you will most certainly need to "register" the row as being deleted in case of a roll-back (unless you do this only for NO-UNDO tables).

I didn't look into the changes yet.

#4 Updated by Constantin Asofiei 3 months ago

You mean the Row.key is not the same as the internal PK?

Also, there is 'recid' index (with just this column) in H2 created for FWD temp-tables.

#5 Updated by Constantin Asofiei 3 months ago

Constantin Asofiei wrote:

You mean the Row.key is not the same as the internal PK?

I mean Row.key to be the same as the 'recid' PK we set on position zero in the row's data array.

#6 Updated by Constantin Asofiei 3 months ago

Alexandru, the assumption was that the MultiplexedBatchIndex$Batch holds data only for a certain multiplex. This is not correct (maybe because of re-used keys?).

Currently a batch can have rows for more than one multiplex ID.

#7 Updated by Alexandru Lungu 3 months ago

Constantin Asofiei wrote:

Alexandru, the assumption was that the MultiplexedBatchIndex$Batch holds data only for a certain multiplex. This is not correct (maybe because of re-used keys?).

Currently a batch can have rows for more than one multiplex ID.

This is absolutely no good. The invariant shouldn't be (ever) broken. Do you have a scenario that triggers such anomaly?

#8 Updated by Constantin Asofiei 3 months ago

The recreate is this:
  • mpid1.p
    def temp-table tt1 field f1 as int index ix1 f1.
    def var k as int.
    
    procedure proc0.
       def var i as int.
       run mpid2.p(input-output table tt1).
       buffer tt1:empty-temp-table().
       do i = 1 to 10:
          create tt1.
          tt1.f1 = i.
          release tt1.
       end.
    end.
    
    do transaction:
       run proc0.
    end.
    
    
  • mpid2.p
    def temp-table tt1 field f1 as int index ix1 f1.
    
    def var i as int.
    def input-output param table for tt1.
    
    do i = 1 to 10000:
       create tt1.
       tt1.f1 = i.
       release tt1.
    end.
    
    for each tt1: delete tt1. end.
    
    

For an undoable temp-table, a EMPTY-TEMP-TABLE results in a UPDATE tt1 SET _multiplex = <delete-multiplex> WHERE _multiplex = <old-multiplex>. The assumption with the batches was that the multiplex ID is read-only - this is not true for undoable temp-tables, where a delete is emulated via changing the multiplex (and thus hiding the record from the legacy temp-table).

#9 Updated by Alexandru Lungu 3 months ago

Constantin Asofiei wrote:

For an undoable temp-table, a EMPTY-TEMP-TABLE results in a UPDATE tt1 SET _multiplex = <delete-multiplex> WHERE _multiplex = <old-multiplex>. The assumption with the batches was that the multiplex ID is read-only - this is not true for undoable temp-tables, where a delete is emulated via changing the multiplex (and thus hiding the record from the legacy temp-table).

Do you mean the INPUT-OUTPUT TABLE functionality, right? I see your point, but I thought this wasn't implemented yet (it seems it was implemented in the mean-time).

Anyway, the UPDATE in H2 is a DELETE + INSERT. All records will be first removed (doing something like a DELETE ... WHERE multiplex = <old-multiplex>) and then added (doing an insert for each row). The only issue I see is the recid. It is still the old recid from the source table. I am mostly sure this is wrong. 4GL is reindexing the records, so we shall do this the same (reserve id before insert).

1. How urgent is this? Anything breaking right now?
2. We can have a direct-access API for this (changeMultiplex). It will basically do the UPDATE + reindexing in the updating.
3. Otherwise, we can have a special SQL keyword (REINDEX) for the UPDATE, that is reindexing the records after update.

#10 Updated by Constantin Asofiei 3 months ago

No, the empty-temp-table is the problem, while executing in a transaction, for an undoable temp-table - this does not remove the records, but changes the multiplex ID.

For MultiplexedScanIndex, the target batch in setRow(key, row) is calculated from the record's key (which is not changing!) - so a remove/add will end up in the same batch.

Urgency: the performance for #8363 is urgent, the bug appears only when introducing 8593a_h2 in ETF.

#11 Updated by Alexandru Lungu 3 months ago

For MultiplexedScanIndex, the target batch in setRow(key, row) is calculated from the record's key (which is not changing!) - so a remove/add will end up in the same batch.

Hmm, I see. Drop my statement then - the INPUT-OUTPUT tables are not yet implemented with UPDATE.

Still, why are we updating with a <delete-multiplex> instead of actually deleting them? If they are undoable, we can rollback (with a savepoint or not). Why do we need to have such functionality of marking them as deleted?

#12 Updated by Eric Faulhaber 3 months ago

Alexandru Lungu wrote:

Still, why are we updating with a <delete-multiplex> instead of actually deleting them? If they are undoable, we can rollback (with a savepoint or not). Why do we need to have such functionality of marking them as deleted?

This technique of updating the multiplex ID to emulate an undoable delete predates (by almost two decades) our use of savepoints and any editing of H2 source code.

#13 Updated by Greg Shah 3 months ago

Seems like a long-standing leak of records. If that built up to a large number, it might be both a memory leak and a drag on performance.

#14 Updated by Eric Faulhaber 3 months ago

Greg Shah wrote:

Seems like a long-standing leak of records. If that built up to a large number, it might be both a memory leak and a drag on performance.

No, this "emulated delete" only happens at the subtransaction level, so that an UNDO is able to restore the records by switching the multiplex ID back. They are actually deleted at transaction commit, or restored at transaction rollback. At least, that was the original implementation; I can't speak to this new issue that Constantin detected.

#15 Updated by Alexandru Lungu 3 months ago

Eric, even for a sub-transaction. We can simply delete and rollback to savepoint, right?

#16 Updated by Eric Faulhaber 3 months ago

Alexandru Lungu wrote:

Eric, even for a sub-transaction. We can simply delete and rollback to savepoint, right?

Yes, we should be able to now. We just weren't using savepoints back then, and this technique has survived all these years. If you make this change, we should test thoroughly, since this has been a core design point from the beginning. I don't recall if/where we have dependent assumptions on this approach in the runtime.

#17 Updated by Alexandru Lungu 3 months ago

Just a thing that was on my mind and wanted to have it here as well:

For NO-UNDO cases, we used to rollback (transaction or savepoint) and redo the state of the NO-UNDO tables (i.e. rollbacking the rollback). This is no longer the case as FWD-H2 has the NO-UNDO keyword in place, so rollbacks are only for UNDO tables.

Yes, we should be able to now. We just weren't using savepoints back then, and this technique has survived all these years. If you make this change, we should test thoroughly, since this has been a core design point from the beginning. I don't recall if/where we have dependent assumptions on this approach in the runtime.

Ok

#18 Updated by Greg Shah 2 months ago

  • % Done changed from 100 to 50
  • Status changed from Review to WIP

#19 Updated by Constantin Asofiei 2 months ago

Previously we used the UPDATE ... SET _MULTIPLEX = ? paired with a ReversibleBulkDelete, which was meant to restore the multiplex ID (via a reverse UPDATE), in case of rollback of an undoable temp-table.

ReversibleBulkDelete was removed I think when we introduced no-undo temp-tables directly in FWD-H2 (or the savepoints?).

I do not think we actually need the UPDATE ... SET _MULTIPLEX = ? in TemporaryBuffer.removeRecords anymore, the forceDelete flag should always be true. Currently is set to:

forceDelete = forceDelete || !isUndoable() || !bufferManager.isTransaction()

where:
  • !isUndoable() means will be true for NO-UNDO temp-tables
  • !bufferManager.isTransaction() means we are not in a transaction
So, for a bulk delete, this means that forceDelete needs to be true for:
  • when we are inside a transaction
  • when the temp-table is undoable

I'm testing this approach now. Please let me know if you see anything wrong.

#20 Updated by Constantin Asofiei 2 months ago

  • % Done changed from 50 to 100
  • Status changed from WIP to Review
Created task branch 8593a from trunk rev 15157, with:
  • For a NEW temp record which was not flushed and is being deleted or rolled back in the transaction, the allocated primary key row (which is an empty template) to reserve this primary key, needs to be explicitly removed.
  • Savepoints are used to rollback undoable temp-tables, and no-undo support is builtin into H2 itself; so, TemporaryBuffer.removeRecords will always perform an actual DELETE.

8593a_h2 rev 48 contains the FWD-H2 changes related to this.

The test for this is:

def temp-table tt1 field f1 as int index ix1 f1.
def var c as int.

do transaction:

   do transaction:
      create tt1.
      assign tt1.f1 = 1.
      release tt1.
      create tt1.
      assign tt1.f1 = 2.
      // do a delete or a undo, not both
      // delete tt1.  
      // undo, leave.
   end.

   undo, leave.
end.

c = 0.
for each tt1: c = c + 1. end.
message c.

#21 Updated by Alexandru Lungu 2 months ago

The FWD direct-access changes are OK!
Also, getting rid of the forceDelete is good, but requires extensive testing!

For H2:
  • rowCount -= removed.cardinality; is something that is theoretical irrelevant. The cardinality should always be 0 when killing the multiplex. the killMultiplex API always relies that this is a batch clear-up attempt, not an actual removal of rows. If you encountered this as being "required", it means some kind of invariant was broken.
  • return i SearchRow.ROWID_INDEX ? ValueLong.get(key) : data null ? null : data[i]; why is data allowed to be null in the first place?
  • of course, this also requires extensive testing.

#22 Updated by Constantin Asofiei 2 months ago

Alexandru Lungu wrote:

The FWD direct-access changes are OK!
Also, getting rid of the forceDelete is good, but requires extensive testing!

ETF testing passed with the current changes, and also was the reason why these bugs surfaced.

  • rowCount -= removed.cardinality; is something that is theoretical irrelevant. The cardinality should always be 0 when killing the multiplex. the killMultiplex API always relies that this is a batch clear-up attempt, not an actual removal of rows. If you encountered this as being "required", it means some kind of invariant was broken.
Well, the invariant was/is already broken:
  • rowCount at the MultiplexedScanIndex tracks only the 'flushed' records - it does not include the 'rows allocated for primary key reservation'
  • cardinality at the batch tracks all rows added to the batch.

So that was the reason I had to add 'clearPrimaryKey' for unflushed records which are deleted/rollback, as otherwise the DELETE ... WHERE _MULTIPLEX = ? statement in TemporaryBuffer.removeRecords failed the invariant.

  • return i SearchRow.ROWID_INDEX ? ValueLong.get(key) : data null ? null : data[i]; why is data allowed to be null in the first place?

I think this was because of the 'template row' with null data used to allocate a row to reserve a primary key in a batch.

#23 Updated by Alexandru Lungu 2 months ago

Constantin Asofiei wrote:

  • rowCount -= removed.cardinality; is something that is theoretical irrelevant. The cardinality should always be 0 when killing the multiplex. the killMultiplex API always relies that this is a batch clear-up attempt, not an actual removal of rows. If you encountered this as being "required", it means some kind of invariant was broken.
Well, the invariant was/is already broken:
  • rowCount at the MultiplexedScanIndex tracks only the 'flushed' records - it does not include the 'rows allocated for primary key reservation'
  • cardinality at the batch tracks all rows added to the batch.

So that was the reason I had to add 'clearPrimaryKey' for unflushed records which are deleted/rollback, as otherwise the DELETE ... WHERE _MULTIPLEX = ? statement in TemporaryBuffer.removeRecords failed the invariant.

Do you mean that the invariant was broken and now it was fixed by clearPrimaryKey? I understood the fact that there were "leaking reservations" and these were removed by the rollback changes.

The original code always presumed that the records were deleted before the multiplex was killed, so we always presumed that the cardinality if 0 when killing the batches. I agree this was broken before your changes, but now it should be fixed, right? So rowCount -= removed.cardinality; is no longer needed.

I think this was because of the 'template row' with null data used to allocate a row to reserve a primary key in a batch.

Oh yeah, right. The code is reserving slots with a null data for the sake of performance/memory.

#24 Updated by Constantin Asofiei 2 months ago

Alexandru Lungu wrote:

Do you mean that the invariant was broken and now it was fixed by clearPrimaryKey? I understood the fact that there were "leaking reservations" and these were removed by the rollback changes.

Is still somehow broken: a template row for the allocated primary key is not added to the other indexes. So the row count at MultiplexedScanIndex, the table and other indexes will always have the 'flushed row count', while if you sum the batches' cardinality, this will be different.

The original code always presumed that the records were deleted before the multiplex was killed, so we always presumed that the cardinality if 0 when killing the batches. I agree this was broken before your changes, but now it should be fixed, right? So rowCount -= removed.cardinality; is no longer needed.

I want to keep it, as at the moment the multiplex is deleted, we want to ensure all is stable.

Also, I think the actual DirectAccessHelper.killMultiplex is no longer needed - it tries to remove rows which were previously removed via the DELETE from TemporaryBuffer.removeRecords, plus it clears only the MultiplexedScanIndex - what about the other indexes?

#25 Updated by Alexandru Lungu 2 months ago

Constantin Asofiei wrote:

Alexandru Lungu wrote:

Do you mean that the invariant was broken and now it was fixed by clearPrimaryKey? I understood the fact that there were "leaking reservations" and these were removed by the rollback changes.

Is still somehow broken: a template row for the allocated primary key is not added to the other indexes. So the row count at MultiplexedScanIndex, the table and other indexes will always have the 'flushed row count', while if you sum the batches' cardinality, this will be different.

Oh, the invariant I was talking about is removed.cardinality 0 when killing a multiplex. I agree that the sum of cardinalities is different from the rowCount.

Also, I think the actual DirectAccessHelper.killMultiplex is no longer needed - it tries to remove rows which were previously removed via the DELETE from TemporaryBuffer.removeRecords, plus it clears only the MultiplexedScanIndex - what about the other indexes?

The need for killMultiplex was due to the empty batches that were left behind. For UNDO tables, empty batches are left behind in case an UNDO happens. For instance, if you remove a record, the batch (even empty) will stay alive for UNDOABLE tables, waiting for an UNDO. If the UNDO never comes, the batch will leak. killMultiplex was intentionally designed to only drop the empty batches of UNDOABLE tables. That is why the removed.cardinality 0 invariant should hold when killing the multiplex and that is why we always need to clear the table before killing the multiplex.

#26 Updated by Constantin Asofiei 2 months ago

Alexandru Lungu wrote:

Oh, the invariant I was talking about is removed.cardinality 0 when killing a multiplex. I agree that the sum of cardinalities is different from the rowCount.

When the actual killMultiplex(multiplexId) is being called from FWD, no more 'primary key allocation rows' should exist; also, all batches should be already removed because of the removeAll(session, multiplexID) call.

The need for killMultiplex was due to the empty batches that were left behind. For UNDO tables, empty batches are left behind in case an UNDO happens. For instance, if you remove a record, the batch (even empty) will stay alive for UNDOABLE tables, waiting for an UNDO. If the UNDO never comes, the batch will leak. killMultiplex was intentionally designed to only drop the empty batches of UNDOABLE tables. That is why the removed.cardinality 0 invariant should hold when killing the multiplex and that is why we always need to clear the table before killing the multiplex.

I need to double-check for undoable temp-tables. The case I'm thinking is a empty-temp-table call followed by undo, leave. This would trigger a DELETE ... FROM ... _MULTIPLEX = ?, which will call removeAll and also remove all the batches. The rollback operation should create the batches again.

#27 Updated by Alexandru Lungu 2 months ago

Constantin Asofiei wrote:

I need to double-check for undoable temp-tables. The case I'm thinking is a empty-temp-table call followed by undo, leave. This would trigger a DELETE ... FROM ... _MULTIPLEX = ?, which will call removeAll and also remove all the batches. The rollback operation should create the batches again.

This is a huge thing. Removing a batch will make the MultiplexScanIndex quite unstable. If you want to recreate it, you will need to do some kind of linked list traversal to find its spot and "revive" it. This is the exact reason for why I was preferring to leak them and let them be cleared entirely.

I think the removeAll caused by the truncate should only clear the batches (not actually remove them). In fact, it can remove the batches only for NO-UNDO. Also, mind that the "last" batch is a bit more special as it can reclaim keys. Therefore, never remove a batch if it is the "last" for a multiplex.

#28 Updated by Constantin Asofiei 2 months ago

Alexandru Lungu wrote:

This is a huge thing. Removing a batch will make the MultiplexScanIndex quite unstable. If you want to recreate it, you will need to do some kind of linked list traversal to find its spot and "revive" it. This is the exact reason for why I was preferring to leak them and let them be cleared entirely.

I think the removeAll caused by the truncate should only clear the batches (not actually remove them). In fact, it can remove the batches only for NO-UNDO. Also, mind that the "last" batch is a bit more special as it can reclaim keys. Therefore, never remove a batch if it is the "last" for a multiplex.

Understood, thanks.

#29 Updated by Constantin Asofiei 2 months ago

Alexandru, please review 8593a_h2 rev 49/50 and 8593a rev 15158. I'll need to re-test these changes.

#30 Updated by Constantin Asofiei 2 months ago

Constantin Asofiei wrote:

Alexandru, please review 8593a_h2 rev 49/50 and 8593a rev 15158. I'll need to re-test these changes.

ETF testing passed. Alexandru, if you can run a ChUI regression testing with this and 8363g, please let me know.

#31 Updated by Alexandru Lungu 2 months ago

ETF testing passed. Alexandru, if you can run a ChUI regression testing with this and 8363g, please let me know.

ChUI takes ~2h:30mins. Do you want to test them both or separately? I would prefer to have them together as I have limited time on the bare-metal due to the intensive performance testing we are doing today.

#32 Updated by Constantin Asofiei 2 months ago

You can test them both (8593a + 8636g). I've uploaded fwd-h2 1.46 and 8593a rev 15161 has the change in build.gradle, so you can build directly. Just patch 8593a over 8363g.

#33 Updated by Constantin Asofiei 2 months ago

Alexandru, please let me know if you managed to run ChUI - I'd like to merge these tomorrow.

#34 Updated by Alexandru Lungu 2 months ago

  • Status changed from Review to Internal Test

Yes. I had to upgrade to latest 6667i (after recent rebase), reconvert are set a new baseline. After, I patched (8593a + 8636g), ensured only FWD-H2 1.46 is in deploy/lib and re-run the tests. I left them overnight.
I am looking at them now - everything PASSED.

The changes are good to go!

#35 Updated by Constantin Asofiei 2 months ago

  • Status changed from Internal Test to Merge Pending

#36 Updated by Greg Shah 2 months ago

You can go ahead with the merge of 8593a and 8636g.

#37 Updated by Constantin Asofiei 2 months ago

  • Status changed from Merge Pending to Test

8593a was merged to trunk rev 15169 and archived.
8593a_h2 was merged to fwd-h2 rev 46 and archived.

#38 Updated by Constantin Asofiei 2 months ago

  • Related to Bug #8703: NO-UNDO option at DMO must be removed added

#40 Updated by Constantin Asofiei 2 months ago

  • Status changed from Test to Internal Test
Created task branch 8593b from trunk rev 15186. Contains regression fixes (mixed):
  • revno: 15190 - Fixed regression in trunk rev 15169; refs #8593:
    • A NEW and INVALID record has already been removed from the temp-table, so no need to clear its primary key.
    • updated to fwd-h2 rev 1.47 via 8593b_h2 - a bulk delete which removes all rows from the multiplex must clean the multiplex and leave it in a state to reclaim the very first key.
  • revno: 15189 - Fixed bulk delete where the FQL was prepared with a faulty alias. Refs #8593 #8715
  • revno: 15188 - When a BEFORE-TABLE is being built too, for a dynamic temp-table, the associated DMOs for both the before and after table must have the after/beforeTable annotations set. Refs #8545
  • revno: 15187 - Fixed regression in trunk rev 15179: when waiting for data on the socket, 'hasMoreDataworker' must not assume there is no more data, if the waiting has finishedbecause of a timeout. Also, a SocketTimeoutException must be thrown only if SO_TIMEOUT is not zero and the timeout has actually expired. Refs #8486

I'm placing this in ETF testing. There are also changes in 8593b_h2.

Eric/Ovidiu/Greg - please review.

#41 Updated by Greg Shah 2 months ago

Code Review Task Branch 8593b Revision 15179

The LowLevelSocketImpl changes look good.

Eric/Ovidiu: Please look at the persistence changes ASAP.

#42 Updated by Ovidiu Maxiniuc 2 months ago

Code Review Task Branch 8593b.

I do not see anything wrong with the code (r15187-).

#43 Updated by Constantin Asofiei 2 months ago

8593b passed ETF testing.

#44 Updated by Greg Shah 2 months ago

  • Status changed from Internal Test to Merge Pending

You can merge to trunk.

#45 Updated by Constantin Asofiei 2 months ago

  • Status changed from Merge Pending to Test

Branch 8593b was merged to trunk rev 15189 and archived.

#46 Updated by Alexandru Lungu about 2 months ago

  • % Done changed from 100 to 80
  • Status changed from Test to WIP

Timeline:

1. I was thinking of replacing the sequence based PK generator used by FastCopyHelper with a more customized syntax to retrieve the PK directly from the table (MultiplexedScanIndex). However, I started off by doing some tests and I couldn't confirm your finding that lead to 1.47. I used:

def temp-table tt no-undo field f1 as int.

def var i as int.

do transaction:
    repeat i = 1 to 300:
        create tt.
        tt.f1 = i.
    end.
end.

output to test.txt.

for each tt:
    message "B" recid(tt).
end.

run b.p(input-output table tt).

for each tt:
    message "A" recid(tt).
end.

and

def temp-table tt field f1 as int.

define input-output parameter table for tt.

do transaction:
    for each tt:
        tt.f1 = tt.f1 * 10.
    end.
end.

for each tt:
    message "I" recid(tt).
end.
In this case, I had:
  • B 2304-2475 and 2560-2686 (2 batches)
  • I 4352-4524 and 4608-4734 (2 batches)
  • A 6400-6567 and 6656-6787 (2 batches)

Disregard the huge offsets (beside the batches, 4GL uses some kind of pages so that the offsets are sometimes bigger across multiplexes - this is not implemented in FWD). Also, the batches have some padding, so they don't have exactly 256 records.
I also attempted with UNDO or APPEND, but I have similar results (however, the A(fter) results are using the same page from B(efore), but is not reclaiming).

PS: Using BY-REFERENCES results in having the same ids all over the place.

2. But finally I found some extra weirdness that drove me to another idea. If I am doing the same test as above, but only with 100 records, I have:
  • B 2304-2403 (1 batch)
  • I 4352-4451 (1 batch)
  • A 2304-2403 (1 batch)

It reclaims! I tried edge cases (with 173/174 record - the break point between batch sizes) and confirmed that reclaiming happens only when there is one single batch.

Overall, this means that the changes in FWD-H2 1.47 should hold only when there is only one single batch. When there are multiple batches, the old implementation (of assigning a new fresh batch) is correct.

I am planning to fine-tune getNextPrimaryKey bulk. If there is only one batch - use 1.47 implementation. If there are at least 2 batches, use the old implementation.
Anything else to take into account?

#47 Updated by Constantin Asofiei about 2 months ago

This is a test where it gets into an infinite loop:

def temp-table tt1 no-undo field f1 as int index ix1 f1.
def temp-table tt2 like tt1.

do transaction:
  create tt1.
  tt1.f1 = 1.
  release tt1.
end.

for each tt1 where tt1.f1 > 0:
   message tt1.f1.
   run proc0 (input-output table tt1).
end.

procedure proc0.
   def input-output parameter table for tt2.
end.

This can be expanded to have more than one batch. So the 'old implementation' I don't think is OK.

#48 Updated by Alexandru Lungu about 2 months ago

Constantin Asofiei wrote:

This is a test where it gets into an infinite loop:
[...]

This can be expanded to have more than one batch. So the 'old implementation' I don't think is OK.

With your test:
  • If I have only one batch (1 - ~150 records), I have the following output (including the recid):
    2304 1
    2305 2
    2306 3
    ...
    2402 99
    2403 100
  • If I have more than one batch (I tested with 300 records, I have the following output:
    2304 1
    6400 1
    4353 2
    2306 3
    6402 3
    4355 4
    2308 5
    6404 5
    4357 6
    ....
    2667 299
    6763 299
    4716 300

So as said, there is a difference between copying 1 batch and copying more than 1 batch. Apparently it doesn't enter an infinite loop.

The answer is regarding the pages (that are not implemented in FWD). Note that only odd records are iterated twice:
  • The original rows are in page 1.
  • When going into proc0, the rows are moved in page 2.
  • When returning, if there are more than one batch, the records go on page 3.
  • This way record 1 is iterated again
  • proc0 is called and rows are moved to page 1. (see how multiplex 2 is reusing the page 1 that was used by multiplex 1 at first). This is happenning just because the table is NO-UNDO and we know that multiplex 1 can't reuse items from page 1.
  • proc0 returns and rows are moved back to page 2. At this point, the for each goes to the element 2.

This approves my idea from #8593-46. It is OK to move records to a new page (higher PK) if more than 1 batch.
The only missing puzzle piece is the page implementation in FWD. Because there are no pages, we always assign higher batches and don't recycle the empty pages.

Pages are always 2048 in size and can contain (apparently) only 8 batches. If all batches are empty, the page can be reused cross-multiplex.

Lastly: For UNDO, the situation is different and requires way more investigation overall (it increases the PK several times until the end of the pages; at that point it seems it forces reclaiming to avoid leaking out the page).

#49 Updated by Alexandru Lungu about 2 months ago

  • Status changed from WIP to Review

Committed 8593c_h2/rev. 48 and 8593c/rev. 15203. Contains the following:

  • H2 support for NEXT PK FOR (<table-name>, <multiplex>) construct to work similarly as NEXT VALUE FOR <sequence-name>, but using the internal multiplexed scan index PK reservation mechanism.
  • This includes a bit of refactoring of MultiplexScanIndex.getNextPrimaryKey. The is no way to distinguish between append and no-append, so simply work as "always reclaim the lowest PK possible". If no PK to reclaim, assign a new batch. This is an aggressive implementation to reclaim when possible to avoid infinite loops. This is not 100% accurate to how 4GL works (page based indexing), but it is better than what we have now. I left a TODO on this which can be tackled in a future iteration.
  • FWD changes includes a refactoring of FastCopyHelper to switch from sequence based indexing to table based indexing (NEXT PK FOR syntax). This way, I purged the getNextPrimaryKey bulk functionality entirely.
    • Constantin, there was some kind of functionality that was relying on consecutive numbering before, but it was more like an optimization. If the copy was not append mode, we would always presumed consecutive numbering so we can easier index normalized extents and before tables. However, there was an in-DB PK mapping functionality that was used for append mode and I widened its usage in 8593c. This may imply some performance draw-back. TL;DR: If a non-consecutive mapping was required, it was done in-DB. If not, it was presumed a consecutive order and a sequence was used. With my changes, all non-trivial cases that require a PK mapping will use an in-DB one.
    • Also, this means that even "simple" fast copy operations will do in-DB PK mappings, so there are new execution paths enabled that should be carefully tested.

- by in-DB I mean "a database table that holds the PK mapping and is joined when doing the INSERT INTO SELECT FROM"

With this, example from #8593-47 is not running infinitely (with 1, 150 or 300 records) and the leak is no more. However, these are the only tests I did by now. Please review this first iteration of changes.

#50 Updated by Greg Shah about 2 months ago

Ovidiu/Eric: Please review 8593c.

This is quite Urgent since it is to resolve a regression in a customer application that has been present for over a week.

#51 Updated by Alexandru Lungu about 2 months ago

I committed now 8593c/rev. 15204. This includes a requirement to use in-DB PK mapping if the table has normalized extent fields. This was found during regression testing.
With this final version, the unit tests full suite of a large customer doesn't regress.

#52 Updated by Ovidiu Maxiniuc about 2 months ago

Eric, I am starting the review process in the next minutes. Please let me know if you are already in the process.

#53 Updated by Alexandru Lungu about 2 months ago

I committed 8593c_h2/rev. 50: Allowing forced reclaiming only for NO-UNDO tables. Otherwise, on a rollback, we may have conflicting PK.
This means that COPY-TEMP-TABLE will never reclaim keys for UNDO table. This is accurate to the findings in 4GL.

#54 Updated by Alexandru Lungu about 2 months ago

I am also having a hard time tracking down an issue with the truncate now (not related to my changes in 8593c/8593c_h2). It seems like, at the moment of truncate, there are two records in the table. This first one has data, but the second one is just a reserved PK.

The root issue is that dataset fill may sometimes reach BufferImpl.FillWorker.fillRow with the following code:

if (!validated)
            {
               if (isMerge)
               {
                  // merge mode: ignore it, just skip to next record
                  TransactionManager.disableLoopProtection(null);
                  hasMore.assign(fillQuery._getNext());
                  // we need to release the newly created record as if undone. We cannot call
                  // BlockManager.undoNext(lbl) here as it will affect the buffer(s) from the query
                  // driving the fill operation.
                  recBuf.release(true, false); // I.e. drop() - this form avoids a buffer() lookup
                  return false;

Before reaching this, it always does selfBuf.create and it eventually releases the buffer due to recBuf.release. However, this is quite a special release as it doesn't flush, due to the first parameter true which stands for undo. Avoiding the flush, this DMO leaks its reserved PK.

Constantin: this is part of your effort to support clearPrimaryKey. Currently, it takes effect in ChangeSet.rollback and Session.delete, but this case is a bit more special. It may be registered so that ChangeSet.rollback will clear its PK, but it would be too late. If we hurry the process, we may end up clearing the same PK twice, which is again bad. I am not quite sure how to table the PK clear on release with undo.

I wonder if we can't be more lenient with the reserved PK and simply ignore them when truncating. On the other end, I identified this exact scenario (hopefully the last one) so I can work around it.

#55 Updated by Constantin Asofiei about 2 months ago

Alexandru, I don't like leaving 'allocated records' with no meaning in H2. Can we explicitly clear the primary key for this fill case, too?

#56 Updated by Alexandru Lungu about 2 months ago

These are the changes that unblocked me:

=== modified file 'src/com/goldencode/p2j/persist/RecordBuffer.java'
--- old/src/com/goldencode/p2j/persist/RecordBuffer.java    2024-04-23 07:57:16 +0000
+++ new/src/com/goldencode/p2j/persist/RecordBuffer.java    2024-05-13 18:47:14 +0000
@@ -11862,6 +11862,19 @@
                                                                dbTrigManager);
             }

+            if (isTemporary && oldRecord.checkState(DmoState.NEW) && undo)
+            {
+               try
+               {
+                  // this is important to dealloc the reserved PK
+                  persistence.getSession().delete(oldRecord);
+               }
+               catch (PersistenceException e)
+               {
+                  LOG.log(Level.SEVERE, "Failed deleting the transient record on undo", e);
+               }
+            }
+            
             updateNoUndoState();

This is basically removing the DMO entirely from the session if isTemporary, NEW and undo. The delete from the session is also removing the reservation. The undo release is used only by this fill and BufferImpl.drop (in case exceptions occur). The changes are mostly in the safe side as they are hit in very specific cases.

But this requires testing.

#57 Updated by Ovidiu Maxiniuc about 2 months ago

Greg Shah wrote:

Ovidiu/Eric: Please review 8593c.

I am OK with the code. A few of notes (cleanup/low priority):
  • TemporaryBuffer.java
    • nextPrimaryKey() is now public and @Override annotation dropped. It is never used to require public visibility and it still overrides the method in RecordBuffer;
    • getSequenceName() doesn't seem to be used any more. More than that Context.sequenceName is used to create the sequence and delete it, but no other uses, since FastCopyHelper relies in H2 now. I think these occurrences can be dropped as well.
  • FastCopyHelper.java:
    • you marked FastCopyBundlePKElement as @Deprecated, but I think it can be dropped completely since it is not in use any more.

#58 Updated by Alexandru Lungu about 2 months ago

  • % Done changed from 80 to 100

Constantin, please review 8593c_h2 changes. If they are OK, please upload it to the GCD repository.

#59 Updated by Alexandru Lungu about 2 months ago

  • % Done changed from 100 to 80

Committed 8593c/rev. 15205 assessing the review from Ovidiu.

#60 Updated by Ovidiu Maxiniuc about 2 months ago

I have reviewed the changes from 8593c/rev. 15205. Good job!

Is there a reason for downgrading the percentage done to 80?

#61 Updated by Constantin Asofiei about 2 months ago

I've uploaded 1.48 to projsrv01 and upgraded the version in rev 15206.

I'll run ETF and test with another customer app tomorrow.

#62 Updated by Alexandru Lungu about 2 months ago

  • % Done changed from 80 to 100

Reset %Done.

#63 Updated by Constantin Asofiei about 2 months ago

ETF testing passed, the original #7143-903 is fixed, and another app works OK with this.

#64 Updated by Greg Shah about 2 months ago

You can merge to trunk now.

#65 Updated by Constantin Asofiei about 2 months ago

Greg Shah wrote:

You can merge to trunk now.

Starting.

#66 Updated by Constantin Asofiei about 2 months ago

Branch 8593c was merged to trunk revision 15212 and archived.

Branch 8593c_h2 was merged to fwd-h2 trunk revision 48 and archived.

#67 Updated by Greg Shah about 2 months ago

This resolves the regression documented in #7143-900?

#68 Updated by Constantin Asofiei about 2 months ago

Greg Shah wrote:

This resolves the regression documented in #7143-900?

Yes

#69 Updated by Alexandru Lungu about 2 months ago

Committed 8593d/rev. 15206 to fix #8593-54. It contains the patch from #8593-56.
Recap: there was an issue on dataset fill where a transient record is undo-ed using a "special" release that doesn't flush. In this case, the DMO should be deleted to drop the reserved PK.

#70 Updated by Greg Shah about 2 months ago

Is there any further review needed for this or do we just need testing?

#71 Updated by Alexandru Lungu about 1 month ago

8593d needs a review first.

#72 Updated by Greg Shah about 1 month ago

Ovidiu: Please review.

#73 Updated by Ovidiu Maxiniuc about 1 month ago

  • Status changed from Review to Internal Test

The changes seem logical. I think the code is OK.

#74 Updated by Alexandru Lungu about 1 month ago

I am picking this up for a customer unit tests.
As #8593 is broadly about performance in FWD-H2, Constantin, can you pick this up for another customer application that drove the effort in #8593 and ETF?

#75 Updated by Alexandru Lungu about 1 month ago

I am talking about 8593d.

#76 Updated by Constantin Asofiei about 1 month ago

Alexandru Lungu wrote:

Constantin, can you pick this up for another customer application that drove the effort in #8593 and ETF?

Testing is OK.

#77 Updated by Constantin Asofiei about 1 month ago

Actually I get this in the server log:

Failed deleting the transient record on undo com.goldencode.p2j.persist.PersistenceException: Attempting DML which cannot be committed!

#78 Updated by Alexandru Lungu about 1 month ago

  • Status changed from Internal Test to Review

Hmmm, I think I see the problem. Indeed, it is not right to delete the DMO as it is not even inserted in the database in the first place.
I liked to have such solution to have the clearPrimaryKey only in ChangeSet and Session, but apparently I will need to extend its usage for this case too. Committed 8593d/rev. 15207, please Review.
I am restarting to test the changes.

#79 Updated by Constantin Asofiei about 1 month ago

  • Status changed from Review to Internal Test

ETF testing passed.

#80 Updated by Alexandru Lungu about 1 month ago

Rebased 8593d/rev. 15263.

Tested this with a customer unit tests and everything went fine. Although, I think the testing is quite slim right now.
The project requires the latest H2 + heavy work with datasets to properly test - this blocks me on testing other suites which are more light-weight.

Thus, Constantin, can you do some smoke tests on another app with 8593d? This is the best way to confirm the changes won't regress right now.
I intend to have this merged today by EOD as it is the last missing piece of 7156c. After, I will rebase 7156c to start a conversion with it this weekend.

#81 Updated by Constantin Asofiei about 1 month ago

The change is not OK - I get this:

Caused by: java.lang.NullPointerException: row not found
        at org.h2.pagestore.db.MultiplexedScanIndex.clearPrimaryKey(MultiplexedScanIndex.java:415)
        at org.h2.embedded.FWDDirectAccessDriver.clearPrimaryKey(FWDDirectAccessDriver.java:132)
        at com.goldencode.p2j.persist.orm.DirectAccessHelper.clearPrimaryKey(DirectAccessHelper.java:171)
        at com.goldencode.p2j.persist.RecordBuffer.setCurrentRecord(RecordBuffer.java:11919)
        at com.goldencode.p2j.persist.RecordBuffer.release(RecordBuffer.java:9345)
        at com.goldencode.p2j.persist.RecordBuffer.release(RecordBuffer.java:9307)
        at com.goldencode.p2j.persist.BufferImpl$FillWorker.fillRow(BufferImpl.java:11987)
        at com.goldencode.p2j.persist.BufferImpl$FillWorker.lambda$processDataSource$1(BufferImpl.java:11791)
        at com.goldencode.p2j.persist.RecordBuffer.batch(RecordBuffer.java:3276)
        at com.goldencode.p2j.persist.BufferImpl$FillWorker.lambda$processDataSource$2(BufferImpl.java:11796)
        at com.goldencode.p2j.util.ErrorManager.silentWorker(ErrorManager.java:4134)

#82 Updated by Constantin Asofiei 6 days ago

Alexandru, I've rebased 8593d from trunk rev 15311 . In 8593d/15314 there are several changes:
  • during FILL will always register the record as 'batch dirty', as direct access will not go through the RecordBuffer handler
  • a bulk delete will delete records from the buffers first (as a NEW record will have the PK in the database, and this needs to be removed before the temp-table truncate in H2).
  • changed 'clearPrimaryKey' to return false instead of throwing a NPE - see also 8593d_h2 rev 49.

I'll test this with ETF.

#84 Updated by Constantin Asofiei 6 days ago

8593d passed ETF testing.

#85 Updated by Constantin Asofiei 5 days ago

Alexandru: please review.

#86 Updated by Alexandru Lungu 5 days ago

Review of 8593d/rev. 15314:

  • BufferImpl: The addDirtyBatchBuffer addition is a nice catch, but I would like to have some kind of guarantee this will not leak. I've seen other placed that do a startBatchAssignMode / endBatchAssignMode pair before/after fully filling the row.
  • good change in RecordBuffer
  • for TemporaryBuffer;
    • I don't get clearBuffers. The record of this buffer will be released before deleting, so it will be flushed to the database. The transient record is persisted and then deleted afterwards using SQL. How is the explicit call to RecordBuffer.delete helping the process here? It may be a good change; I just don't understand it. Do you have a case that is fixed by this?
    • you clear the buffers only if there is no where. What if the where is not-null, but still covering the transient record in the buffer?
  • DirectAccessHelper changes require FWD_H2 changes. Are they in a FWD-H2 branch?

#87 Updated by Constantin Asofiei 5 days ago

Alexandru Lungu wrote:

Review of 8593d/rev. 15314:

  • BufferImpl: The addDirtyBatchBuffer addition is a nice catch, but I would like to have some kind of guarantee this will not leak. I've seen other placed that do a startBatchAssignMode / endBatchAssignMode pair before/after fully filling the row.

This is always bracketed in a batch:

            Runnable update = () -> RecordBuffer.batch(batchFill);

  • for TemporaryBuffer;
    • I don't get clearBuffers. The record of this buffer will be released before deleting, so it will be flushed to the database. The transient record is persisted and then deleted afterwards using SQL. How is the explicit call to RecordBuffer.delete helping the process here? It may be a good change; I just don't understand it. Do you have a case that is fixed by this?

See this test:

def temp-table tt1 field f1 as int index ix1 is unique f1.
def dataset ds1 for tt1.

do transaction:
   create tt1.
   tt1.f1 = 1.
   release tt1.
   def buffer btt1 for tt1.
   create btt1.
   btt1.f1 = 2.

   dataset ds1:empty-dataset().
end.

The point is the bulk delete triggers H2 table truncate, and the record is loaded in another buffer, but not flushed - thus only the reserved PK exists in the MultiplexScanIndex.

  • you clear the buffers only if there is no where. What if the where is not-null, but still covering the transient record in the buffer?

Then the H2 table truncate is not being called.

  • DirectAccessHelper changes require FWD_H2 changes. Are they in a FWD-H2 branch?

See 8593d_h2 branch.

#88 Updated by Alexandru Lungu 5 days ago

Got it!

For the example provided: I checked your test in 4GL and indeed, EMPTY TEMP-TABLE (similar to empty dataset) is flushing records from other buffers before deleting. So the change is reasonable.
However, this means that this was not working before truncate implementation (I think). Was this basically a latent bug that is now causing more problems with bulk delete?

H2 changes are OK.

Other testing you suggest? I am planning to take these to other application unit tests that I do with 7156c/6667g.

#89 Updated by Constantin Asofiei 5 days ago

Alexandru Lungu wrote:

For the example provided: I checked your test in 4GL and indeed, EMPTY TEMP-TABLE (similar to empty dataset) is flushing records from other buffers before deleting. So the change is reasonable.
However, this means that this was not working before truncate implementation (I think). Was this basically a latent bug that is now causing more problems with bulk delete?

Yes, this is a latent bug, but I think it may apply for a bulk delete (via DELETE FROM) when a WHERE clause exists, too. We may want to move to a loop delete if there is a WHERE clause and records loaded in any buffer.

Other testing you suggest? I am planning to take these to other application unit tests that I do with 7156c/6667g.

ChUI regression testing.

#90 Updated by Alexandru Lungu 5 days ago

ChUI regression testing.

I have a stable system with 6667i done just recently. I suppose I can move with 6667g so I have the FWD-H2 upgrade. I am making a baseline with 6667g now.

Please merge and upload the FWD-H2 changes so it can ease testing of 8593d. After, I will proceed with testing 6667g with unit tests of another customer.

#91 Updated by Constantin Asofiei 5 days ago

Done, 8593d_h2 was merged to rev 49 (and uploaded to artifacts) and 8593d rev 15315 updates build.gradle to 1.49

#92 Updated by Alexandru Lungu 4 days ago

I have some bad news for 8593d:
  • After 3 unit test modules that went smooth, the server entered (from the looks of it) an infinite loop that caused OutOfMemoryException
  • After that, the OOME caused a bad exception unwinding that unbalanced the reserved primary keys. At that point, the H2 connection went rogue.
  • All unit tests in that module failed due to closed connection exception.
  • Moving on, some modules went ok, but some failed again. This time, not with OOME, but random rowCount expected 2 got 1. However, this seems to always be caused by INSERT INTO SELECT FROM statements.
  • The server got stuck for several hours on a module. I think it met an infinite loop again (but one that wasn't leaking memory, so OOME was not thrown).
For troubleshooting;
  • I will attempt to find a small test-case by debugging the unit test.

#93 Updated by Constantin Asofiei 2 days ago

I found another issue with this test:

def temp-table tt1 field f1 as int index ix1 is unique f1.
def dataset ds1 for tt1.

do transaction:
   create tt1.
   tt1.f1 = 1.
   release tt1.
   def buffer btt1 for tt1.
   create btt1.
   assign btt1.f1 = 1 no-error.

   dataset ds1:empty-dataset(). // this fails the truncate
end.

The ASSIGN ... NO-ERROR marks the record as invalid, but the empty-dataset see this record as invalid and does not proceed with 'clearPrimaryKey'. The fix is this in orm.Session.delete

Index: src/com/goldencode/p2j/persist/orm/Session.java
===================================================================
--- workspace/trunk/src/com/goldencode/p2j/persist/orm/Session.java
+++ workspace/trunk/src/com/goldencode/p2j/persist/orm/Session.java
@@ -900,7 +900,7 @@
       {  
          success = persister.delete(dmo);
       }
-      else if (dmo instanceof TempRecord && !dmo.checkState(INVALID))
+      else if (dmo instanceof TempRecord)
       {
          try
          {

But this may not be what you've seen in those unittests.

#94 Updated by Alexandru Lungu 1 day ago

Constantin, the exception is triggered by a:
You may not delete BEFORE-TABLE <name> record. Use ACCEPT-ROW-CHANGES instead.
exception.

When calling TemporaryBuffer.removeRecord, one of the slave buffers is a before buffer. In this case, you can't delete it by demand. See TemporaryBuffer.delete(Supplier<logical> valexp, Supplier<character> valmsg):

      if (buf.isBeforeBuffer() && buf.hasBeforeBufferConstraints())
      {
         ErrorManager.recordOrThrowError(12374, buf.doGetName(), "");
         // You may not delete BEFORE-TABLE <name> record.  Use ACCEPT-ROW-CHANGES instead.
         return false;
      }

#95 Updated by Constantin Asofiei 1 day ago

Do you have a FWD stacktrace when this is happening?

#96 Updated by Alexandru Lungu 1 day ago

doCloseMultiplexScope:7689, TemporaryBuffer (com.goldencode.p2j.persist)
closeMultiplexScope:6848, TemporaryBuffer (com.goldencode.p2j.persist)
resourceDeleted:6494, TemporaryBuffer (com.goldencode.p2j.persist)
resourceDeleteImpl:9914, BufferImpl (com.goldencode.p2j.persist)
lambda$resourceDelete$7:9871, BufferImpl (com.goldencode.p2j.persist)
exec:-1, BufferImpl$$Lambda$939/0x00007f1d3ccb2288 (com.goldencode.p2j.persist)
timer:132, NanoTimer (com.goldencode.p2j.jmx)
resourceDelete:9871, BufferImpl (com.goldencode.p2j.persist)
delete:475, HandleChain (com.goldencode.p2j.util)
delete:5225, BufferImpl (com.goldencode.p2j.persist)
deleteResource:3495, ProcedureManager$ProcedureHelper (com.goldencode.p2j.util)
lambda$deleteAllBuffers$2:2726, AbstractTempTable (com.goldencode.p2j.persist)
accept:-1, AbstractTempTable$$Lambda$1381/0x00007f1d3d0fbfb0 (com.goldencode.p2j.persist)
deleteAllBuffers:2740, AbstractTempTable (com.goldencode.p2j.persist)
forceClear:3980, TempTableBuilder (com.goldencode.p2j.persist)
forceDelete:3227, TempTableBuilder (com.goldencode.p2j.persist)
resourceDelete:3216, TempTableBuilder (com.goldencode.p2j.persist)
delete:475, HandleChain (com.goldencode.p2j.util)
forceDelete:3232, TempTableBuilder (com.goldencode.p2j.persist)
resourceDelete:3216, TempTableBuilder (com.goldencode.p2j.persist)
delete:475, HandleChain (com.goldencode.p2j.util)
clearImpl:1311, DataSet (com.goldencode.p2j.persist)
resourceDelete:5653, DataSet (com.goldencode.p2j.persist)
delete:475, HandleChain (com.goldencode.p2j.util)
delete:491, HandleOps (com.goldencode.p2j.util)
deleteResources:1360, WidgetPool$WidgetPoolData (com.goldencode.p2j.util)
deletePool:796, WidgetPool (com.goldencode.p2j.util)
delete:361, WidgetPool (com.goldencode.p2j.util)

In that point, TemporaryBuffer.removeRecords is called. Where clause is null, so clearBuffers is called. A slave buffer is found and deleted. The code in #8593-94 is reached.

#97 Updated by Alexandru Lungu 1 day ago

TBD: also done ChUI regression testing and there are 14 regressions, while only 1 was expected. Can't tell the reason yet.

#98 Updated by Alexandru Lungu 1 day ago

Also, to clarify things, due to this error regarding before buffers, DirectAccessHelper.killMultiplex is called to clear the multiplexed index (in a finally block). Thus, the table has rowCount 1 (because the delete failed), but the multiplexed scan index is empty, as the multiplex was killed anyway. The fatal error occurs later when the table is used again, noticing the invariant break.

#99 Updated by Constantin Asofiei 1 day ago

Thanks, the recreate is this:

def temp-table tt1 before-table brt1 field f1 as int index ix1 is unique f1.
def dataset ds1 for tt1.

def var hds as handle.
def var htt as handle.
def var hbb as handle.

create dataset hds.
hds:create-like(dataset ds1:handle).
hbb = hds::tt1.
htt = hbb:table-handle.

do transaction:
   hbb:buffer-create().
   hbb::f1 = 1.
   htt:Tracking-changes = true.
   hbb:buffer-delete().

   delete object hds. 

   undo, leave.
end.

The fix for this and the one mentioned in #8593-93 are in 8593d rev 15316.

#100 Updated by Alexandru Lungu 1 day ago

Review of 8593d/rev. 15316.

  • I am OK with the changes in Session
  • For TemporaryBuffer: isn't the flag you are changing a guard to avoid spoiling the after-before table relationships? I was thinking of something similar, but I was thinking that this may have an impact the peer relationship. What if you delete the before buffer (quite forcefully now) and the after buffer is still referencing it at that time? I think the after buffer will have a dangling reference to a deleted record.

#101 Updated by Constantin Asofiei 1 day ago

Alexandru Lungu wrote:

  • For TemporaryBuffer: isn't the flag you are changing a guard to avoid spoiling the after-before table relationships?

Yes, but that is used only in cases when we are no trying to empty the entire temp-table (which includes deleting the both after and before tables).

What if you delete the before buffer (quite forcefully now) and the after buffer is still referencing it at that time? I think the after buffer will have a dangling reference to a deleted record.

The 'after-table' buffers have their records deleted first; if there are referenced records in these after buffers, then any equivalent before-table records will get deleted.

After that, you go through the 'before-table' buffers: in this case, you can have 'ROW-DELETED' records loaded as NEW in the before-buffer, and these don't have any equivalent in the after-table.

#102 Updated by Alexandru Lungu 1 day ago

The 'after-table' buffers have their records deleted first; if there are referenced records in these after buffers, then any equivalent before-table records will get deleted.

This makes sense only in "bulk delete" kind of scenarios. So I agree with you if the following point is clarified.

Yes, but that is used only in cases when we are no trying to empty the entire temp-table (which includes deleting the both after and before tables).

Is this always true? Can't the 4GL code simply clear the before table like EMPTY TEMP-TABLE btt? Or will this also clear the after table implicitly? Or is this forbidden?

#103 Updated by Constantin Asofiei 1 day ago

Alexandru Lungu wrote:

Is this always true? Can't the 4GL code simply clear the before table like EMPTY TEMP-TABLE btt? Or will this also clear the after table implicitly? Or is this forbidden?

4GL does not allow EMPTY-TEMP-TABLE on a before-buffer.

#104 Updated by Alexandru Lungu 1 day ago

Constantin Asofiei wrote:

Alexandru Lungu wrote:

Is this always true? Can't the 4GL code simply clear the before table like EMPTY TEMP-TABLE btt? Or will this also clear the after table implicitly? Or is this forbidden?

4GL does not allow EMPTY-TEMP-TABLE on a before-buffer.

Got it! Picking up the changes and restart testing.

#105 Updated by Alexandru Lungu about 20 hours ago

ChUI regression tests passed completely.
The unit-tests are way better (in the sense that the found problem was fixed), but there are still regressions:
  • some modules have 1-5 unit tests failing (these may be false negatives, but I will retest)
  • one module is regressed hard (<10% success rate) - this is what I am investigating now
  • one module is stuck in an infinite loop

#106 Updated by Alexandru Lungu about 15 hours ago

I don't have the very exact scenario this is causing this, but I am trying to find one. To post some insights:

Session.delete is throwing Requested to delete DMO of type %s with id %d. A different DMO instance of type %s is already bound to this session with the same. This is triggered by TemporaryByuffer.clearBuffers. In fact, there are 17 slave buffers in total, but only 2 have a current record.
  • The first buffer contains a record: NOUNDO DELETING DELETED with recid 2560 and some data and a datasource-rowid
  • The second buffer contains the cached record: NOUNDO CACHED with recid 2560 and some different data and no datasource-rowid

The most suspect fact is that the first buffer is not CACHED, so it shouldn't be searched in the cache anyway.

By absolute best guess here is that the code in TemporaryByuffer.clearBuffers should skip the buffers that have a DELETED record anyway. Otherwise, we attempt double-delete that may (incorrectly) hit the session cache sanity check.
My lack of understanding: why is a deleted record still referenced by a buffer? is this a valid state anyway? I see that RecordBuffer.stateChanged is releasing the buffer if its current record was deleted.

#107 Updated by Alexandru Lungu about 13 hours ago

Progress:

The culprit buffer has its record deleted by an emptyDataset - clearBuffers. However, the buffer is not registered with the ChangeBroker (why? this is still to be investigated). Therefore, the record remains in the buffer (not CACHED, but DELETED).
As H2 is reclaiming ids, that id will be given to another record.

When attempting to do a second emptyDataset - clearBuffers on the same table, the deleted record (although not cached) will find an "alter ego" (i.e. with the same id) inside the session cache and fail.

The obvious fix still remains: skip delete record buffers that contain a deleted record, but I am afraid this will patch a logic problem we already have in place.

#108 Updated by Constantin Asofiei about 13 hours ago

Alexandru Lungu wrote:

However, the buffer is not registered with the ChangeBroker (why? this is still to be investigated). Therefore, the record remains in the buffer (not CACHED, but DELETED).

Hm... I wonder if this is not the reason for the A different DMO instance of type is already bound errors we see in some other unit tests.

#109 Updated by Eric Faulhaber about 7 hours ago

Alexandru Lungu wrote:

Progress:

The culprit buffer has its record deleted by an emptyDataset - clearBuffers. However, the buffer is not registered with the ChangeBroker (why? this is still to be investigated). Therefore, the record remains in the buffer (not CACHED, but DELETED).

A buffer (let's say the culprit buffer is Buffer A) should never contain a deleted record.

  • Scenario 1: if the record is deleted from Buffer A, it should be removed as part of the delete operation. The ChangeBroker should notify all other registered buffers, so they can remove the deleted record, if they currently hold it.
  • Scenario 2: (the flip-side of Scenario 1) if the record is deleted from another buffer (Buffer B) while it is simultaneously contained in Buffer A, the ChangeBroker should notify Buffer A of the delete, and Buffer A's response to the delete event should be to remove the deleted record. If this is the case, it is key to understand why Buffer A is not registered with the ChangeBroker. Was it never registered, or was the registration lost too early? The listener algorithm in ChangeBroker is pretty complex, as I recall.
  • Scenario 3: if the record is not contained in Buffer A at the time it is deleted from some other buffer, it should never be allowed to be loaded later into Buffer A thereafter. The DELETED flag should prevent any use of the record, including loading it into a buffer or caching it.

Does emptyDataset - clearBuffers fall under any of these scenarios, or is there another scenario I have missed?

In any case, a DMO should be removed from the session cache and marked with the DELETED flag at the time it is deleted. It seems the latter is happening in this case. Is it also removed from the session cache?

As H2 is reclaiming ids, that id will be given to another record.

When attempting to do a second emptyDataset - clearBuffers on the same table, the deleted record (although not cached) will find an "alter ego" (i.e. with the same id) inside the session cache and fail.

The obvious fix still remains: skip delete record buffers that contain a deleted record, but I am afraid this will patch a logic problem we already have in place.

Yes, this is not really a fix because, as noted above, if things are working as expected, there should not be any buffers which contain a deleted record.

Also available in: Atom PDF