Bug #3345
deadlock in virtual session usage
100%
History
#1 Updated by Constantin Asofiei over 6 years ago
- File rfq_server_deadlock.log
added
- File deadlock_3260a.log
added
- File gso_server_deadlock.log
added
#3260-18 found a deadlock which had an attempted fix in 3260a 11166 (which was backed out as it was incorrect).
The problem is a deadlock between the Queue and RouterSessionManager monitors, as can be seen in the attached logs.
#2 Updated by Constantin Asofiei over 6 years ago
The problem is concurrent SecurityContext.cleanup()
calls:
"Reader [000000A4:appuser]": at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:504) - waiting to lock <0x00000006c07a0548> (a com.goldencode.p2j.net.RouterSessionManager) at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141) at com.goldencode.p2j.persist.dirty.DirtyShareFactory.unregisterManagerForDatabase(DirtyShareFactory.java:269) at com.goldencode.p2j.persist.DatabaseManager.deregisterDatabase(DatabaseManager.java:2082) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2326) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2298) at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2277) at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:233) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:255) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:241) at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697) at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602) at com.goldencode.p2j.security.SecurityContext.cleaxnupWorker(SecurityContext.java:527) at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505) - locked <0x00000007771f95e0> (a java.lang.Object) at com.goldencode.p2j.security.SecurityManager.terminateSession(SecurityManager.java:3884) at com.goldencode.p2j.net.BaseSession.cleanupContext(BaseSession.java:415) at com.goldencode.p2j.net.BaseSession.end(BaseSession.java:463) at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1256) at com.goldencode.p2j.net.SessionManager.queueStopped(SessionManager.java:1695) - locked <0x00000006c07a0838> (a java.lang.Object) - locked <0x00000007771f9e60> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.Queue.stop(Queue.java:465) - locked <0x00000007771f9e60> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.Protocol.stopQueue(Protocol.java:382) at com.goldencode.p2j.net.Protocol.access$600(Protocol.java:154) at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:487) - locked <0x00000007771fa020> (a java.lang.Object) at java.lang.Thread.run(Thread.java:745) "Conversation [000000A5:appuser]": at com.goldencode.p2j.net.SessionManager.deregisterSession(SessionManager.java:1355) - waiting to lock <0x00000006c07a0838> (a java.lang.Object) - locked <0x000000077926ba30> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.RouterSessionManager.deregisterSession(RouterSessionManager.java:987) at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1218) at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:524) - locked <0x000000077926bb98> (a java.lang.Object) - locked <0x00000006c07a0548> (a com.goldencode.p2j.net.RouterSessionManager) at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2336) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2298) at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2277) at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:233) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:255) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:241) at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697) at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:527) at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505) - locked <0x00000007a0b81bd8> (a java.lang.Object) at com.goldencode.p2j.security.SecurityManager.endContext(SecurityManager.java:7398) at com.goldencode.p2j.security.SecurityManager.popContextWorker(SecurityManager.java:7336) at com.goldencode.p2j.security.SecurityManager.popAndRestoreSecurityContext(SecurityManager.java:4493) at com.goldencode.p2j.net.RouterSessionManager.restoreContext(RouterSessionManager.java:1163) at com.goldencode.p2j.net.Conversation.run(Conversation.java:263) at java.lang.Thread.run(Thread.java:745)
This should never happen; what was missing was protection in SecurityManager.terminateSessionWorker
, to not continue if cleanupLocks.get(ctx)
already contains a flag (which indicates another thread is cleaning up).
The fix was committed to 3430a rev 11222.
Note that the gso/rfq logs are with the #3260-18 fix attempt, which was incorrect; the main log is the deadlock_3260a.log file.
#3 Updated by Constantin Asofiei over 6 years ago
- % Done changed from 0 to 90
- Status changed from New to WIP
- Assignee set to Constantin Asofiei
#4 Updated by Greg Shah over 6 years ago
Code Review Task Branch 3430a Revision 11222
I'm good with the changes.
#5 Updated by Constantin Asofiei over 6 years ago
3430a rev 11222 passed runtime testing.
#6 Updated by Greg Shah over 6 years ago
CTRL-C testing passed too?
#7 Updated by Constantin Asofiei over 6 years ago
Yes, I have a clean run using runtime-regression
.
#8 Updated by Greg Shah over 6 years ago
Unless you see a reason to wait, go ahead with a rebase and then merge to trunk.
#9 Updated by Constantin Asofiei over 6 years ago
3340a was merged to trunk rev 11221 and archived.
#10 Updated by Constantin Asofiei over 6 years ago
There is still a deadlock, even with my fix:
Found one Java-level deadlock: ============================= "Reader": waiting to lock monitor 0x00007ff0a4004468 (object 0x0000000783506c70, a com.goldencode.p2j.net.Queue), which is held by "Router 2" "Router 2": waiting to lock monitor 0x00007ff0e0005858 (object 0x00000006c1366508, a java.lang.Object), which is held by "Reader [0000009A:syman]" "Reader [0000009A:syman]": waiting to lock monitor 0x00007ff0d80072f8 (object 0x00000006c1366218, a com.goldencode.p2j.net.RouterSessionManager), which is held by "Conversation [0000009B:syman]" "Conversation [0000009B:syman]": waiting to lock monitor 0x00007ff0e0005858 (object 0x00000006c1366508, a java.lang.Object), which is held by "Reader [0000009A:syman]" Java stack information for the threads listed above: =================================================== "Reader": at com.goldencode.p2j.net.Queue.checkState(Queue.java:1304) - waiting to lock <0x0000000783506c70> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.Queue.enqueueInbound(Queue.java:717) at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:424) at java.lang.Thread.run(Thread.java:748) "Router 2": at com.goldencode.p2j.net.RouterSessionManager.createDirectSession(RouterSessionManager.java:894) - waiting to lock <0x00000006c1366508> (a java.lang.Object) - locked <0x0000000783506c70> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.Router.route(Router.java:358) at com.goldencode.p2j.net.Router.access$100(Router.java:104) at com.goldencode.p2j.net.Router$RouterStub.run(Router.java:688) at java.lang.Thread.run(Thread.java:748) "Reader [0000009A:syman]": at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:504) - waiting to lock <0x00000006c1366218> (a com.goldencode.p2j.net.RouterSessionManager) at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2369) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2331) at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2310) at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:246) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:268) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:254) at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697) at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:527) at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505) - locked <0x00000007978ebfe8> (a java.lang.Object) at com.goldencode.p2j.security.SecurityManager.terminateSessionWorker(SecurityManager.java:3988) at com.goldencode.p2j.security.SecurityManager.terminateSession(SecurityManager.java:3892) at com.goldencode.p2j.net.BaseSession.cleanupContext(BaseSession.java:415) at com.goldencode.p2j.net.BaseSession.end(BaseSession.java:463) at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1256) at com.goldencode.p2j.net.SessionManager.queueStopped(SessionManager.java:1695) - locked <0x00000006c1366508> (a java.lang.Object) - locked <0x00000007971b0958> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.Queue.stop(Queue.java:465) - locked <0x00000007971b0958> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.Protocol.stopQueue(Protocol.java:382) at com.goldencode.p2j.net.Protocol.access$600(Protocol.java:154) at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:487) - locked <0x00000007978ec060> (a java.lang.Object) at java.lang.Thread.run(Thread.java:748) "Conversation [0000009B:syman]": at com.goldencode.p2j.net.SessionManager.deregisterSession(SessionManager.java:1355) - waiting to lock <0x00000006c1366508> (a java.lang.Object) - locked <0x00000007971889b8> (a com.goldencode.p2j.net.Queue) at com.goldencode.p2j.net.RouterSessionManager.deregisterSession(RouterSessionManager.java:987) at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1218) at com.goldencode.p2j.net.SessionManager.terminateVirtual(SessionManager.java:524) - locked <0x00000007972d4440> (a java.lang.Object) - locked <0x00000006c1366218> (a com.goldencode.p2j.net.RouterSessionManager) at com.goldencode.p2j.net.RequesterSession.terminate(RequesterSession.java:141) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2369) at com.goldencode.p2j.persist.ConnectionManager.disconnectImmediately(ConnectionManager.java:2331) at com.goldencode.p2j.persist.ConnectionManager.sessionClosed(ConnectionManager.java:2310) at com.goldencode.p2j.persist.ConnectionManager.access$100(ConnectionManager.java:246) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:268) at com.goldencode.p2j.persist.ConnectionManager$1.cleanup(ConnectionManager.java:254) at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:697) at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:623) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:602) at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:527) at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:505) - locked <0x000000077fd6d038> (a java.lang.Object) at com.goldencode.p2j.security.SecurityManager.endContext(SecurityManager.java:7603) at com.goldencode.p2j.security.SecurityManager.popContextWorker(SecurityManager.java:7541) at com.goldencode.p2j.security.SecurityManager.popAndRestoreSecurityContext(SecurityManager.java:4663) at com.goldencode.p2j.net.RouterSessionManager.restoreContext(RouterSessionManager.java:1163) at com.goldencode.p2j.net.Conversation.run(Conversation.java:263) at java.lang.Thread.run(Thread.java:748)
I think the SecurityManager.cleanupLocks
map should have as keys the queue, not the context - this is because for multiplexed queues, there could be two different contexts cleaning up, as seen above: 0000009A:syman
vs 0000009B:syman
.
#11 Updated by Constantin Asofiei over 6 years ago
3383a rev 11224 is the second attempt at fixing this deadlock - it sets a common lock (Derived from queue) for all sessions (including virtual sessions), so that it will be used by context cleanup.
#12 Updated by Constantin Asofiei over 6 years ago
I've let it running 10 times in a loop over the weekend, only ctrl-c - it managed to run without deadlocks, but only one run was fully passed.
#13 Updated by Greg Shah over 6 years ago
but only one run was fully passed.
Were these failures the commonly seen issues in the 3-way tests?
#14 Updated by Constantin Asofiei over 6 years ago
Greg Shah wrote:
but only one run was fully passed.
Were these failures the commonly seen issues in the 3-way tests?
Yes
#15 Updated by Greg Shah over 6 years ago
Code Review Task Branch 3383a Revision 11224
I'm good with the deadlock fix.
#16 Updated by Greg Shah over 6 years ago
- % Done changed from 90 to 100
- Status changed from WIP to Closed
This can be closed, right?
#17 Updated by Constantin Asofiei over 6 years ago
Greg Shah wrote:
This can be closed, right?
Yes, final fix was in 3383a which was merged to trunk as revision 11227.