Bug #2305: context-local vars need to be reset in a predetermined order - Runtime Infrastructure - Golden Code Redmine

Bug #2305

context-local vars need to be reset in a predetermined order

Added by Constantin Asofiei about 10 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Constantin Asofiei

Target version:

FWD - Cleanup and Stablization for Server Features

Start date:

Due date:

% Done:

100%

billable:

vendor_id:

GCD

case_num:

version_reported:

version_resolved:

ca_upd20140520c.zip (343 KB) Constantin Asofiei, 05/20/2014 01:12 PM

dependencies.zip - .graphml files with the context-local var dependencies (18.3 KB) Constantin Asofiei, 05/20/2014 01:12 PM

context-cleanup-exceptions.log (58.5 KB) Constantin Asofiei, 05/21/2014 05:37 AM

ca_upd20140601b.zip (389 KB) Constantin Asofiei, 06/01/2014 02:42 PM

ca_upd20140605a.zip (390 KB) Constantin Asofiei, 06/05/2014 02:54 AM

History

#1 Updated by Constantin Asofiei about 10 years ago

File ca_upd20140520c.zip added
File dependencies.zip added

This task is split from #2266 notes 33,34,35,36,37,38 (updates ca_upd20140512b.zip and ca_upd20140513a.zip from #2266 are related to the mentioned notes).
A resume of what it was discussed:
1. cleaning the context-local vars in an undetermined way results in spurious abends, as the context-local vars during reset may access other context-local vars, which were already cleaned.
2. 0513a.zip update used integer weights for ordering the context-local vars. Greg suggested to use an enum to define weight factors. The initial weights were these:

 10: DirtyShareFactory
 20: HQLHelperCache
 30: P2JPostgreSQLDialect
 40: DatabaseTriggerManager
 50: InMemoryLockManager
 60: LockTableUpdater
 70: BufferFlushQueue
 80: Persistence
 90: BufferManager
100: TemporaryBuffer
110: FieldAssigner
120: ConnectionManager
130: ChangeBroker
140: TableMapper
145: DatabaseManager$2
150: DatabaseManager$1
160: TempTableConnectionProvider

10000: LogicalTerminal
10010: ErrorManager
10020: TransactionManager
10030: ProcedureManager
10040: HandleChain
10050: handle
10060: DirectoryService
10070: RemoteObject

Context-local var dependencies can be seen in these cases (V1 and V2 are both context-local vars):

V1 instantiates V2 (ContextLocal c'tor is called from another ContextLocal API) - attached context-local-ctors.graphml
V1 accesses V2's payload (ContextLocal.get is called from another ContextLocal API) - attached context-local-get.graphml
V1 sets V2's payload (ContextLocal.set is called from another ContextLocal API) - attached context-local-set.graphml
during context-reset, clear the SecurityContext.tokenMap before cleaning up, and let the order be undetermined (attached t4.graphml file)
during context-reset, let the SecurityContext.tokenMap as is and let the order be undetermined (attached t1.graphml, t2.graphml, t3.graphml files)

From these, the runtime set/get cases can't be used during cleanup, because cleanup cares only about what happens during the cleanup call. To open the .graphml files, install yEd from here: http://www.yworks.com/en/products_yed_download.html Also,

The resulted weight factors were these:

level 1:
Persistence

level 2:
TemporaryBuffer
TempTableConnectionProvider
TableMapper
BufferManager

level 3:
LogicalTerminal

level 4:
TransactionManager

level 5:
ProcedureManager
FieldAssigner
SessionManager
ConnectionManager
ChangeBroker
HandleChain

last level:
handle
ErrorManager
DirectoryService
RemoteObject

A WeightFactor.LAST level is needed to ensure that these vars are cleaned last.

#2 Updated by Greg Shah about 10 years ago

Code Review 0520c

The changes look good.

#3 Updated by Constantin Asofiei about 10 years ago

File context-cleanup-exceptions.log added

After runtime testing 0520c.zip changes, the following need to be addressed:

these are reported as "surviving" after context-cleanup (i.e. they are re-created by someone else)

com.goldencode.p2j.persist.DatabaseManager$3
com.goldencode.p2j.persist.dirty.DirtyShareFactory$1
com.goldencode.p2j.net.Control$1
com.goldencode.p2j.persist.ChangeBroker$1

the Cannot get or set context local value during context cleanup exceptions in the attached log need to be checked.

#4 Updated by Eric Faulhaber about 10 years ago

Most of the ContextLocalCleanupException instances (all but one) follow this pattern:

...
at com.goldencode.p2j.persist.Persistence$1.initialValue(Persistence.java:627)
at com.goldencode.p2j.security.ContextLocal.get(ContextLocal.java:361)
at com.goldencode.p2j.persist.Persistence.cleanup(Persistence.java:3367)
at com.goldencode.p2j.persist.DatabaseManager.deregisterDatabase(DatabaseManager.java:1871)
at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:1965)
...
at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:161)
at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:580)
...

Stanislav added the call from DatabaseManager.deregisterDatabase to Persistence.cleanup (a new method he also added) in bzr rev. 10350, as part of the upgrade to Hibernate 4. The problem is, the implementation of Persistence.cleanup is:

context.get().cleanup();

where context is the context local variable which manages instances of an inner class Persistence$Context, each of which tracks state for the current context. In the scenario you are testing, there is no instance of Persistence$Context available in the context variable for the current session, so it calls initialValue.

The ConnectionManager mechanism was designed with the assumption that at the point of disconnect of a remote database, persistence services would have been used in the current context, and there would be some Persistence$Context state to clean up. This is not the case here; I'm not sure why. So, it is creating a new instance of Persistence$Context needlessly, in order to clean it up. In the process of creating the new instance, it is initializing other context local services, including the same ConnectionManager variable which started off the cleanup effort. Hence, the exception.

We have to either remove the implementation of initialValue for the Persistence.context anonymous inner class, or avoid invoking it in the case of Persistence.cleanup. I don't want to remove the implementation, because there are 32 other places in Persistence which call context.get(), where the creation of a new instance of Persistence$Context is appropriate if there is none for the current session. I guess we could replace these with a call to a private method that checked the return of context.get() for null, then set it if so.

However, I think it might make more sense to add a method to ContextLocal as follows:

public final T get(boolean create)

which does exactly what the existing get() method does, but only invokes initialValue if create is true. The existing get() method would simply invoke this method and pass true as the parameter, and the new method would be the worker.

Persistence.cleanup() would use this new method, passing false, and would only invoke Persistence$Context.cleanup() if a non-null value came back.

The remaining ContextLocalCleanupException instance looks like this:

...
at com.goldencode.p2j.persist.ConnectionManager.get(ConnectionManager.java:1577)
at com.goldencode.p2j.persist.ConnectionManager.getSession(ConnectionManager.java:1392)
at com.goldencode.p2j.persist.remote.RemotePersistence.<init>(RemotePersistence.java:82)
at com.goldencode.p2j.persist.PersistenceFactory.getInstance(PersistenceFactory.java:106)
at com.goldencode.p2j.persist.DatabaseManager.deregisterDatabase(DatabaseManager.java:1871)
at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:1965)
...

It's a similar issue, in that the code assumes the call to PersistenceFactory.getInstance will return an existing instance of RemotePersistence (i.e., the one that was used for the remote connection being cleaned up). Instead, it is creating a new one. We can get around this by refactoring PersistenceFactory in a similar way as described above for ContextLocal.get(boolean), such that a new variant of the PersistenceFactory.getInstance method would return null instead of instantiating a new RemotePersistence instance.

However, before we implement either workaround, I have to understand: why are we seeing these cases where the thing that is supposed to exist and which we are supposed to be cleaning up at end of life does not exist?

#5 Updated by Eric Faulhaber about 10 years ago

Eric Faulhaber wrote:

However, before we implement either workaround, I have to understand: why are we seeing these cases where the thing that is supposed to exist and which we are supposed to be cleaning up at end of life does not exist?

Actually (regarding the first case I analyzed above), I guess it makes sense that the Persistence$Context instance is null at the point DatabaseManager calls Persistence.cleanup(). We are cleaning up the ConnectionManager$1 context-local variable, which has a weight factor of 5, while Persistence$1 has a weight factor of 1, meaning it already has been cleaned up by then. So, I think my first suggested workaround is OK to implement.

I still don't understand (regarding the second case), why PersistenceFactory.getInstance would be creating a new RemotePersistence instance. PersistenceFactory.remove is only called in DatabaseManager.deregisterDatabase after the call to PersistenceFactory.getInstance. It is not called from anywhere else in the project.

#6 Updated by Constantin Asofiei about 10 years ago

Eric Faulhaber wrote:

I still don't understand (regarding the second case), why PersistenceFactory.getInstance would be creating a new RemotePersistence instance. PersistenceFactory.remove is only called in DatabaseManager.deregisterDatabase after the call to PersistenceFactory.getInstance. It is not called from anywhere else in the project.

Consider this scenario:

connect to RFQ DB (via V/C)
press CTRL-C to go back to main menu
exit the client

In this case, no RemotePersistence is being created, as the remote DB was not interrogated at all: for this reason, when the cleanup occurs, the PersistenceFactory will initialize it, as it is the first access for the RemotePersistence.

Thus, the fix would be either make the PersistenceFactory.getInstance be conditional (as the ContextLocal.get is) or either call PersistenceFactory.getInstance when the ConnectionManager establises a remote connection.

On a side note, I think I have a recreated for the An error has occurred attempting to connect to the selected database system. we are seeing in the CTRL-C tests:

V/C to connect to RFQ and press CTRL-C then exit the client (as in the scenario above)
V/C to connect to RFQ fails with the error message.
subsequent V/C to connect to RFQ is OK

Before pushing for a fix for the ~~remoteFactory~~ RemotePersitence problem, I want to debug the MAJIC servers and see why the connect attempt fails, when it should have not.

#7 Updated by Constantin Asofiei about 10 years ago

Constantin Asofiei wrote:

Before pushing for a fix for the ~~remoteFactory~~ RemotePersitence problem, I want to debug the MAJIC servers and see why the connect attempt fails, when it should have not.

I think I found the reason why the remote connection fails some time: if client 1 has established a connection to a remote server and does not disconnects from the remote server before terminating his connection, the Control associated for client 1 has its Control.interrupted flag set; when client 2 connects, the Reader/Writer threads for the remote server connection are still there, so it will just want to authenticate; but, the Queue associated with the remote server connection has the Queue.ctrl field set to the Control for client 1, thus it sees the Control.interrupted flag set to true!

Looking at how the Queue.ctrl field is used, this is just a cache for the local control, associated with the Queue, in normal client-server connections (on the server side). But, when the Queue is used in a multiplexed scenario (like the server-to-server connections), the Queue.ctrl will be set by the first client establishing this connection, and each time Queue.getControl is called for contextID = 0, it will return this instance - which is not OK.

The solution I think is to not use the Queue.ctrl field in multiplexed queues.

#8 Updated by Constantin Asofiei about 10 years ago

Status changed from WIP to Review
File ca_upd20140601b.zip added

This contains the fix for RemotePersistence and note 7 (remote connection). Is going into testing now.

#9 Updated by Greg Shah about 10 years ago

That is really great work! The remote connection problem is a subtle one.

#10 Updated by Constantin Asofiei about 10 years ago

0601b.zip passed runtime testing - I've ran it twice, and no ContextLocalCleanupException or surviving context-local vars are reported.

#11 Updated by Eric Faulhaber about 10 years ago

Update 0601b looks good to me, nice work. Please note that header entry number 098 is duplicated in TemporaryBuffer.

#12 Updated by Constantin Asofiei about 10 years ago

File ca_upd20140605a.zip added
% Done changed from 0 to 100

Attached was committed to bzr rev 10544.

#13 Updated by Greg Shah about 10 years ago

Status changed from Review to Closed

#14 Updated by Greg Shah over 7 years ago

Target version changed from Milestone 11 to Cleanup and Stablization for Server Features

Also available in: Atom PDF

Project

General

Profile

FWD » Core Development » Runtime Infrastructure

Issues

Custom queries