Project

General

Profile

Bug #3345

deadlock in virtual session usage

Added by Constantin Asofiei over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Target version:
-
Start date:
Due date:
% Done:

100%

billable:
No
vendor_id:
GCD
case_num:
version_reported:
version_resolved:

deadlock_3260a.log Magnifier (67.6 KB) Constantin Asofiei, 10/06/2017 04:58 PM

gso_server_deadlock.log Magnifier (102 KB) Constantin Asofiei, 10/06/2017 04:58 PM

rfq_server_deadlock.log Magnifier (49.4 KB) Constantin Asofiei, 10/06/2017 04:58 PM

History

#1 Updated by Constantin Asofiei over 6 years ago

#3260-18 found a deadlock which had an attempted fix in 3260a 11166 (which was backed out as it was incorrect).

The problem is a deadlock between the Queue and RouterSessionManager monitors, as can be seen in the attached logs.

#2 Updated by Constantin Asofiei over 6 years ago

The problem is concurrent SecurityContext.cleanup() calls:

"Reader [000000A4:appuser]":
    at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:504)
    - waiting to lock <0x00000006c07a0548> (a com.goldencode.p2j.net.RouterSessionManager)
    at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141)
    at com.goldencode.p2j.persist.dirty.DirtyShareFactory.unregisterManagerForDatabase(DirtyShareFactory.java:269)
    at com.goldencode.p2j.persist.DatabaseManager.deregisterDatabase(DatabaseManager.java:2082)
    at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2326)
    at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2298)
    at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2277)
    at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:233)
    at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:255)
    at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:241)
    at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697)
    at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623)
    at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602)
    at com.goldencode.p2j.security.SecurityContext.cleaxnupWorker(SecurityContext.java:527)
    at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505)
    - locked <0x00000007771f95e0> (a java.lang.Object)
    at com.goldencode.p2j.security.SecurityManager.terminateSession(SecurityManager.java:3884)
    at com.goldencode.p2j.net.BaseSession.cleanupContext(BaseSession.java:415)
    at com.goldencode.p2j.net.BaseSession.end(BaseSession.java:463)
    at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1256)
    at com.goldencode.p2j.net.SessionManager.queueStopped(SessionManager.java:1695)
    - locked <0x00000006c07a0838> (a java.lang.Object)
    - locked <0x00000007771f9e60> (a com.goldencode.p2j.net.Queue)
    at com.goldencode.p2j.net.Queue.stop(Queue.java:465)
    - locked <0x00000007771f9e60> (a com.goldencode.p2j.net.Queue)
    at com.goldencode.p2j.net.Protocol.stopQueue(Protocol.java:382)
    at com.goldencode.p2j.net.Protocol.access$600(Protocol.java:154)
    at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:487)
    - locked <0x00000007771fa020> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:745)
"Conversation [000000A5:appuser]":
    at com.goldencode.p2j.net.SessionManager.deregisterSession(SessionManager.java:1355)
    - waiting to lock <0x00000006c07a0838> (a java.lang.Object)
    - locked <0x000000077926ba30> (a com.goldencode.p2j.net.Queue)
    at com.goldencode.p2j.net.RouterSessionManager.deregisterSession(RouterSessionManager.java:987)
    at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1218)
    at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:524)
    - locked <0x000000077926bb98> (a java.lang.Object)
    - locked <0x00000006c07a0548> (a com.goldencode.p2j.net.RouterSessionManager)
    at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141)
    at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2336)
    at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2298)
    at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2277)
    at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:233)
    at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:255)
    at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:241)
    at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697)
    at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623)
    at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602)
    at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:527)
    at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505)
    - locked <0x00000007a0b81bd8> (a java.lang.Object)
    at com.goldencode.p2j.security.SecurityManager.endContext(SecurityManager.java:7398)
    at com.goldencode.p2j.security.SecurityManager.popContextWorker(SecurityManager.java:7336)
    at com.goldencode.p2j.security.SecurityManager.popAndRestoreSecurityContext(SecurityManager.java:4493)
    at com.goldencode.p2j.net.RouterSessionManager.restoreContext(RouterSessionManager.java:1163)
    at com.goldencode.p2j.net.Conversation.run(Conversation.java:263)
    at java.lang.Thread.run(Thread.java:745)

This should never happen; what was missing was protection in SecurityManager.terminateSessionWorker, to not continue if cleanupLocks.get(ctx) already contains a flag (which indicates another thread is cleaning up).

The fix was committed to 3430a rev 11222.

Note that the gso/rfq logs are with the #3260-18 fix attempt, which was incorrect; the main log is the deadlock_3260a.log file.

#3 Updated by Constantin Asofiei over 6 years ago

  • % Done changed from 0 to 90
  • Status changed from New to WIP
  • Assignee set to Constantin Asofiei

#4 Updated by Greg Shah over 6 years ago

Code Review Task Branch 3430a Revision 11222

I'm good with the changes.

#5 Updated by Constantin Asofiei over 6 years ago

3430a rev 11222 passed runtime testing.

#6 Updated by Greg Shah over 6 years ago

CTRL-C testing passed too?

#7 Updated by Constantin Asofiei over 6 years ago

Yes, I have a clean run using runtime-regression.

#8 Updated by Greg Shah over 6 years ago

Unless you see a reason to wait, go ahead with a rebase and then merge to trunk.

#9 Updated by Constantin Asofiei over 6 years ago

3340a was merged to trunk rev 11221 and archived.

#10 Updated by Constantin Asofiei over 6 years ago

There is still a deadlock, even with my fix:

Found one Java-level deadlock:
=============================
"Reader":
  waiting to lock monitor 0x00007ff0a4004468 (object 0x0000000783506c70, a com.goldencode.p2j.net.Queue),
  which is held by "Router 2" 
"Router 2":
  waiting to lock monitor 0x00007ff0e0005858 (object 0x00000006c1366508, a java.lang.Object),
  which is held by "Reader [0000009A:syman]" 
"Reader [0000009A:syman]":
  waiting to lock monitor 0x00007ff0d80072f8 (object 0x00000006c1366218, a com.goldencode.p2j.net.RouterSessionManager),
  which is held by "Conversation [0000009B:syman]" 
"Conversation [0000009B:syman]":
  waiting to lock monitor 0x00007ff0e0005858 (object 0x00000006c1366508, a java.lang.Object),
  which is held by "Reader [0000009A:syman]" 

Java stack information for the threads listed above:
===================================================
"Reader":
        at com.goldencode.p2j.net.Queue.checkState(Queue.java:1304)
        - waiting to lock <0x0000000783506c70> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.Queue.enqueueInbound(Queue.java:717)
        at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:424)
        at java.lang.Thread.run(Thread.java:748)
"Router 2":
        at com.goldencode.p2j.net.RouterSessionManager.createDirectSession(RouterSessionManager.java:894)
        - waiting to lock <0x00000006c1366508> (a java.lang.Object)
        - locked <0x0000000783506c70> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.Router.route(Router.java:358)
        at com.goldencode.p2j.net.Router.access$100(Router.java:104)
        at com.goldencode.p2j.net.Router$RouterStub.run(Router.java:688)
        at java.lang.Thread.run(Thread.java:748)
"Reader [0000009A:syman]":
        at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:504)
        - waiting to lock <0x00000006c1366218> (a com.goldencode.p2j.net.RouterSessionManager)
        at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141)
        at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2369)
        at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2331)
        at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2310)
        at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:246)
        at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:268)
        at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:254)
        at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697)
        at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623)
        at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602)
        at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:527)
        at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505)
        - locked <0x00000007978ebfe8> (a java.lang.Object)
        at com.goldencode.p2j.security.SecurityManager.terminateSessionWorker(SecurityManager.java:3988)
        at com.goldencode.p2j.security.SecurityManager.terminateSession(SecurityManager.java:3892)
        at com.goldencode.p2j.net.BaseSession.cleanupContext(BaseSession.java:415)
        at com.goldencode.p2j.net.BaseSession.end(BaseSession.java:463)
        at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1256)
        at com.goldencode.p2j.net.SessionManager.queueStopped(SessionManager.java:1695)
        - locked <0x00000006c1366508> (a java.lang.Object)
        - locked <0x00000007971b0958> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.Queue.stop(Queue.java:465)
        - locked <0x00000007971b0958> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.Protocol.stopQueue(Protocol.java:382)
        at com.goldencode.p2j.net.Protocol.access$600(Protocol.java:154)
        at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:487)
        - locked <0x00000007978ec060> (a java.lang.Object)
        at java.lang.Thread.run(Thread.java:748)
"Conversation [0000009B:syman]":
        at com.goldencode.p2j.net.SessionManager.deregisterSession(SessionManager.java:1355)
        - waiting to lock <0x00000006c1366508> (a java.lang.Object)
        - locked <0x00000007971889b8> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.RouterSessionManager.deregisterSession(RouterSessionManager.java:987)
        at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1218)
        at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:524)
        - locked <0x00000007972d4440> (a java.lang.Object)
        - locked <0x00000006c1366218> (a com.goldencode.p2j.net.RouterSessionManager)
        at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141)
        at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2369)
        at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2331)
        at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2310)
        at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:246)
        at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:268)
        at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:254)
        at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697)
        at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623)
        at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602)
        at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:527)
        at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505)
        - locked <0x000000077fd6d038> (a java.lang.Object)
        at com.goldencode.p2j.security.SecurityManager.endContext(SecurityManager.java:7603)
        at com.goldencode.p2j.security.SecurityManager.popContextWorker(SecurityManager.java:7541)
        at com.goldencode.p2j.security.SecurityManager.popAndRestoreSecurityContext(SecurityManager.java:4663)
        at com.goldencode.p2j.net.RouterSessionManager.restoreContext(RouterSessionManager.java:1163)
        at com.goldencode.p2j.net.Conversation.run(Conversation.java:263)
        at java.lang.Thread.run(Thread.java:748)

I think the SecurityManager.cleanupLocks map should have as keys the queue, not the context - this is because for multiplexed queues, there could be two different contexts cleaning up, as seen above: 0000009A:syman vs 0000009B:syman.

#11 Updated by Constantin Asofiei over 6 years ago

3383a rev 11224 is the second attempt at fixing this deadlock - it sets a common lock (Derived from queue) for all sessions (including virtual sessions), so that it will be used by context cleanup.

#12 Updated by Constantin Asofiei over 6 years ago

I've let it running 10 times in a loop over the weekend, only ctrl-c - it managed to run without deadlocks, but only one run was fully passed.

#13 Updated by Greg Shah over 6 years ago

but only one run was fully passed.

Were these failures the commonly seen issues in the 3-way tests?

#14 Updated by Constantin Asofiei over 6 years ago

Greg Shah wrote:

but only one run was fully passed.

Were these failures the commonly seen issues in the 3-way tests?

Yes

#15 Updated by Greg Shah over 6 years ago

Code Review Task Branch 3383a Revision 11224

I'm good with the deadlock fix.

#16 Updated by Greg Shah over 6 years ago

  • % Done changed from 90 to 100
  • Status changed from WIP to Closed

This can be closed, right?

#17 Updated by Constantin Asofiei over 6 years ago

Greg Shah wrote:

This can be closed, right?

Yes, final fix was in 3383a which was merged to trunk as revision 11227.

Also available in: Atom PDF