Bug #2305
context-local vars need to be reset in a predetermined order
100%
History
#1 Updated by Constantin Asofiei almost 10 years ago
- File ca_upd20140520c.zip added
- File dependencies.zip added
This task is split from #2266 notes 33,34,35,36,37,38 (updates ca_upd20140512b.zip and ca_upd20140513a.zip from #2266 are related to the mentioned notes).
A resume of what it was discussed:
1. cleaning the context-local vars in an undetermined way results in spurious abends, as the context-local vars during reset may access other context-local vars, which were already cleaned.
2. 0513a.zip update used integer weights for ordering the context-local vars. Greg suggested to use an enum to define weight factors. The initial weights were these:
10: DirtyShareFactory 20: HQLHelperCache 30: P2JPostgreSQLDialect 40: DatabaseTriggerManager 50: InMemoryLockManager 60: LockTableUpdater 70: BufferFlushQueue 80: Persistence 90: BufferManager 100: TemporaryBuffer 110: FieldAssigner 120: ConnectionManager 130: ChangeBroker 140: TableMapper 145: DatabaseManager$2 150: DatabaseManager$1 160: TempTableConnectionProvider 10000: LogicalTerminal 10010: ErrorManager 10020: TransactionManager 10030: ProcedureManager 10040: HandleChain 10050: handle 10060: DirectoryService 10070: RemoteObjectContext-local var dependencies can be seen in these cases (V1 and V2 are both context-local vars):
- V1 instantiates V2 (
ContextLocal
c'tor is called from anotherContextLocal
API) - attachedcontext-local-ctors.graphml
- V1 accesses V2's payload (
ContextLocal.get
is called from anotherContextLocal
API) - attachedcontext-local-get.graphml
- V1 sets V2's payload (
ContextLocal.set
is called from anotherContextLocal
API) - attachedcontext-local-set.graphml
- during context-reset, clear the SecurityContext.tokenMap before cleaning up, and let the order be undetermined (attached
t4.graphml
file) - during context-reset, let the SecurityContext.tokenMap as is and let the order be undetermined (attached
t1.graphml
,t2.graphml
,t3.graphml
files)
From these, the runtime set/get cases can't be used during cleanup, because cleanup cares only about what happens during the cleanup call. To open the .graphml
files, install yEd from here: http://www.yworks.com/en/products_yed_download.html Also,
The resulted weight factors were these:
level 1: Persistence level 2: TemporaryBuffer TempTableConnectionProvider TableMapper BufferManager level 3: LogicalTerminal level 4: TransactionManager level 5: ProcedureManager FieldAssigner SessionManager ConnectionManager ChangeBroker HandleChain last level: handle ErrorManager DirectoryService RemoteObject
A WeightFactor.LAST
level is needed to ensure that these vars are cleaned last.
#2 Updated by Greg Shah almost 10 years ago
Code Review 0520c
The changes look good.
#3 Updated by Constantin Asofiei almost 10 years ago
- File context-cleanup-exceptions.log added
- these are reported as "surviving" after context-cleanup (i.e. they are re-created by someone else)
com.goldencode.p2j.persist.DatabaseManager$3 com.goldencode.p2j.persist.dirty.DirtyShareFactory$1 com.goldencode.p2j.net.Control$1 com.goldencode.p2j.persist.ChangeBroker$1
- the
Cannot get or set context local value during context cleanup
exceptions in the attached log need to be checked.
#4 Updated by Eric Faulhaber almost 10 years ago
Most of the ContextLocalCleanupException
instances (all but one) follow this pattern:
... at com.goldencode.p2j.persist.Persistence$1.initialValue(Persistence.java:627) at com.goldencode.p2j.security.ContextLocal.get(ContextLocal.java:361) at com.goldencode.p2j.persist.Persistence.cleanup(Persistence.java:3367) at com.goldencode.p2j.persist.DatabaseManager.deregisterDatabase(DatabaseManager.java:1871) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:1965) ... at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:161) at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:580) ...
Stanislav added the call from DatabaseManager.deregisterDatabase
to Persistence.cleanup
(a new method he also added) in bzr rev. 10350, as part of the upgrade to Hibernate 4. The problem is, the implementation of Persistence.cleanup
is:
context.get().cleanup();
where
context
is the context local variable which manages instances of an inner class Persistence$Context
, each of which tracks state for the current context. In the scenario you are testing, there is no instance of Persistence$Context
available in the context
variable for the current session, so it calls initialValue
.
The ConnectionManager
mechanism was designed with the assumption that at the point of disconnect of a remote database, persistence services would have been used in the current context, and there would be some Persistence$Context
state to clean up. This is not the case here; I'm not sure why. So, it is creating a new instance of Persistence$Context
needlessly, in order to clean it up. In the process of creating the new instance, it is initializing other context local services, including the same ConnectionManager
variable which started off the cleanup effort. Hence, the exception.
We have to either remove the implementation of initialValue
for the Persistence.context
anonymous inner class, or avoid invoking it in the case of Persistence.cleanup
. I don't want to remove the implementation, because there are 32 other places in Persistence
which call context.get()
, where the creation of a new instance of Persistence$Context
is appropriate if there is none for the current session. I guess we could replace these with a call to a private method that checked the return of context.get()
for null, then set it if so.
However, I think it might make more sense to add a method to ContextLocal
as follows:
public final T get(boolean create)
which does exactly what the existing
get()
method does, but only invokes initialValue
if create
is true. The existing get()
method would simply invoke this method and pass true as the parameter, and the new method would be the worker.
Persistence.cleanup()
would use this new method, passing false, and would only invoke Persistence$Context.cleanup()
if a non-null value came back.
The remaining ContextLocalCleanupException
instance looks like this:
... at com.goldencode.p2j.persist.ConnectionManager.get(ConnectionManager.java:1577) at com.goldencode.p2j.persist.ConnectionManager.getSession(ConnectionManager.java:1392) at com.goldencode.p2j.persist.remote.RemotePersistence.<init>(RemotePersistence.java:82) at com.goldencode.p2j.persist.PersistenceFactory.getInstance(PersistenceFactory.java:106) at com.goldencode.p2j.persist.DatabaseManager.deregisterDatabase(DatabaseManager.java:1871) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:1965) ...
It's a similar issue, in that the code assumes the call to PersistenceFactory.getInstance
will return an existing instance of RemotePersistence
(i.e., the one that was used for the remote connection being cleaned up). Instead, it is creating a new one. We can get around this by refactoring PersistenceFactory
in a similar way as described above for ContextLocal.get(boolean)
, such that a new variant of the PersistenceFactory.getInstance
method would return null instead of instantiating a new RemotePersistence
instance.
However, before we implement either workaround, I have to understand: why are we seeing these cases where the thing that is supposed to exist and which we are supposed to be cleaning up at end of life does not exist?
#5 Updated by Eric Faulhaber almost 10 years ago
Eric Faulhaber wrote:
However, before we implement either workaround, I have to understand: why are we seeing these cases where the thing that is supposed to exist and which we are supposed to be cleaning up at end of life does not exist?
Actually (regarding the first case I analyzed above), I guess it makes sense that the Persistence$Context
instance is null at the point DatabaseManager
calls Persistence.cleanup()
. We are cleaning up the ConnectionManager$1
context-local variable, which has a weight factor of 5, while Persistence$1
has a weight factor of 1, meaning it already has been cleaned up by then. So, I think my first suggested workaround is OK to implement.
I still don't understand (regarding the second case), why PersistenceFactory.getInstance
would be creating a new RemotePersistence
instance. PersistenceFactory.remove
is only called in DatabaseManager.deregisterDatabase
after the call to PersistenceFactory.getInstance
. It is not called from anywhere else in the project.
#6 Updated by Constantin Asofiei almost 10 years ago
Eric Faulhaber wrote:
Consider this scenario:I still don't understand (regarding the second case), why
PersistenceFactory.getInstance
would be creating a newRemotePersistence
instance.PersistenceFactory.remove
is only called inDatabaseManager.deregisterDatabase
after the call toPersistenceFactory.getInstance
. It is not called from anywhere else in the project.
- connect to RFQ DB (via V/C)
- press CTRL-C to go back to main menu
- exit the client
In this case, no RemotePersistence is being created, as the remote DB was not interrogated at all: for this reason, when the cleanup occurs, the PersistenceFactory will initialize it, as it is the first access for the RemotePersistence.
Thus, the fix would be either make the PersistenceFactory.getInstance
be conditional (as the ContextLocal.get is) or either call PersistenceFactory.getInstance
when the ConnectionManager
establises a remote connection.
An error has occurred attempting to connect to the selected database system.
we are seeing in the CTRL-C tests:
- V/C to connect to RFQ and press CTRL-C then exit the client (as in the scenario above)
- V/C to connect to RFQ fails with the error message.
- subsequent V/C to connect to RFQ is OK
Before pushing for a fix for the remoteFactory RemotePersitence problem, I want to debug the MAJIC servers and see why the connect attempt fails, when it should have not.
#7 Updated by Constantin Asofiei almost 10 years ago
Constantin Asofiei wrote:
Before pushing for a fix for the
remoteFactoryRemotePersitence problem, I want to debug the MAJIC servers and see why the connect attempt fails, when it should have not.
I think I found the reason why the remote connection fails some time: if client 1 has established a connection to a remote server and does not disconnects from the remote server before terminating his connection, the Control associated for client 1 has its Control.interrupted
flag set; when client 2 connects, the Reader/Writer threads for the remote server connection are still there, so it will just want to authenticate; but, the Queue associated with the remote server connection has the Queue.ctrl
field set to the Control for client 1, thus it sees the Control.interrupted
flag set to true!
Looking at how the Queue.ctrl
field is used, this is just a cache for the local control, associated with the Queue, in normal client-server connections (on the server side). But, when the Queue is used in a multiplexed scenario (like the server-to-server connections), the Queue.ctrl
will be set by the first client establishing this connection, and each time Queue.getControl
is called for contextID = 0
, it will return this instance - which is not OK.
The solution I think is to not use the Queue.ctrl
field in multiplexed queues.
#8 Updated by Constantin Asofiei almost 10 years ago
- Status changed from WIP to Review
- File ca_upd20140601b.zip added
This contains the fix for RemotePersistence and note 7 (remote connection). Is going into testing now.
#9 Updated by Greg Shah almost 10 years ago
That is really great work! The remote connection problem is a subtle one.
#10 Updated by Constantin Asofiei almost 10 years ago
0601b.zip passed runtime testing - I've ran it twice, and no ContextLocalCleanupException
or surviving context-local vars are reported.
#11 Updated by Eric Faulhaber almost 10 years ago
Update 0601b looks good to me, nice work. Please note that header entry number 098 is duplicated in TemporaryBuffer
.
#12 Updated by Constantin Asofiei almost 10 years ago
- File ca_upd20140605a.zip added
- % Done changed from 0 to 100
Attached was committed to bzr rev 10544.
#13 Updated by Greg Shah almost 10 years ago
- Status changed from Review to Closed
#14 Updated by Greg Shah over 7 years ago
- Target version changed from Milestone 11 to Cleanup and Stablization for Server Features