Project

General

Profile

Bug #4071

FWD server stops allowing new web client logins after abend

Added by Eric Faulhaber about 5 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
Due date:
% Done:

0%

billable:
No
vendor_id:
GCD
case_num:
version:

Related issues

Related to Database - Feature #4012: replacing TCP/IP (AF_INET) sockets with Unix Domain Sockets (AF_UNIX) Review
Related to Base Language - Bug #7193: READ-JSON is not supported for the "LONGCHAR" source type Closed

History

#1 Updated by Eric Faulhaber about 5 years ago

I've recently noticed that after some abnormal ends, I can no longer log into the web client without restarting the FWD server. Attempts will time out and result in the, "Cannot start the embedded server." message.

It does not occur consistently. I took a thread dump on the most recent occurrence and found this:

"Reader [00000003:bogus]" #78 daemon prio=5 os_prio=0 tid=0x00007f2c0000c800 nid=0x21fc runnable [0x00007f2c58d25000]
   java.lang.Thread.State: RUNNABLE
        at org.newsclub.net.unix.NativeUnixSocket.read(Native Method)
        at org.newsclub.net.unix.AFUNIXSocketImpl$AFUNIXInputStream.read(AFUNIXSocketImpl.java:184)
        at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:140)
        at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:109)
        at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:67)
        at org.postgresql.core.PGStream.receiveChar(PGStream.java:306)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1952)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
        - locked <0x00000006dc314288> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.jdbc.PgConnection.executeTransactionCommand(PgConnection.java:755)
        at org.postgresql.jdbc.PgConnection.rollback(PgConnection.java:799)
        at com.mchange.v2.c3p0.impl.NewProxyConnection.rollback(NewProxyConnection.java:1033)
        - locked <0x000000076b6abc58> (a com.mchange.v2.c3p0.impl.NewProxyConnection)
        at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doRollback(JdbcTransaction.java:163)
        at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.rollback(AbstractTransactionImpl.java:209)
        at com.goldencode.p2j.persist.Persistence$Context.rollback(Persistence.java:5343)
        at com.goldencode.p2j.persist.Persistence$Context.cleanup(Persistence.java:6178)
        at com.goldencode.p2j.persist.Persistence$1.cleanup(Persistence.java:735)
        at com.goldencode.p2j.persist.Persistence$1.cleanup(Persistence.java:731)
        at com.goldencode.p2j.security.ContextLocal$Wrapper.cleanup(ContextLocal.java:779)
        at com.goldencode.p2j.security.SecurityContext.reset(SecurityContext.java:651)
        at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:630)
        at com.goldencode.p2j.security.SecurityContext.cleanupWorker(SecurityContext.java:555)
        at com.goldencode.p2j.security.SecurityContext.cleanup(SecurityContext.java:533)
        - locked <0x00000006dc3eea20> (a java.lang.Object)
        at com.goldencode.p2j.security.SecurityManager.terminateSessionWorker(SecurityManager.java:4263)
        at com.goldencode.p2j.security.SecurityManager.terminateSession(SecurityManager.java:4166)
        at com.goldencode.p2j.net.BaseSession.cleanupContext(BaseSession.java:415)
        at com.goldencode.p2j.net.BaseSession.end(BaseSession.java:463)
        at com.goldencode.p2j.net.SessionManager.endSession(SessionManager.java:1274)
        at com.goldencode.p2j.net.SessionManager.queueStopped(SessionManager.java:1718)
        - locked <0x0000000697209470> (a java.lang.Object)
        - locked <0x00000006dc305418> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.Queue.stop(Queue.java:496)
        - locked <0x00000006dc305418> (a com.goldencode.p2j.net.Queue)
        at com.goldencode.p2j.net.Protocol.stopQueue(Protocol.java:397)
        at com.goldencode.p2j.net.Protocol.access$600(Protocol.java:155)
        at com.goldencode.p2j.net.Protocol$Reader.run(Protocol.java:502)
        - locked <0x00000006dc334988> (a java.lang.Object)
        at java.lang.Thread.run(Thread.java:748)

The abnormal end in this case was not caused by a persistence error. It was caused by an "invalid frame ID" error. However, it seems that in the process of cleaning up the user's context (on the Reader thread), we attempted a rollback of the current transaction, and we got stuck in the native method org.newsclub.net.unix.NativeUnixSocket.read, listening for a response from the PostgreSQL server. It does seem that this behavior started around the time we implemented the use of AF_UNIX sockets to connect to the database. I don't recall this happening with TCP/IP socket connections to the database.

Possibly just a bug in the org.newsclub.net socket code?

In the server log, I see this:

[05/06/2019 00:26:39 EDT] (com.goldencode.p2j.main.WebClientsManager:SEVERE) The port '7449' couldn't be freed becauseis still open at OS level and the client uuid='b114ad06-0682-4648-8f44-4adc81f476bc' is not re-using it.

Is the hung Reader thread preventing port 7449 from being released, because context cleanup is never finishing?

An easy workaround seems to be to go back to using TCP/IP sockets with the database. I hate to lose the performance improvement from using Unix sockets here, but clearly we can't have this occur in production. Thoughts?

#2 Updated by Constantin Asofiei about 5 years ago

Eric Faulhaber wrote:

Is the hung Reader thread preventing port 7449 from being released, because context cleanup is never finishing?

Yes, if you are using a single port (7449), a new web client can't be spawned unless this port is released - and if the thread is hung, then no other clients will be able to login.

#3 Updated by Greg Shah about 5 years ago

Possibly just a bug in the org.newsclub.net socket code?

Yes, it is possible.

However, this database processing is occurring on the Protocol$Reader thread which seems wrong. I don't think that thread should ever be connecting to the database. If there is no open connection with the database then it seems that any usage would result in undefined (probably bad) behavior. I wonder if that is why com.mchange.v2.c3p0.impl.NewProxyConnection is on the stack.

#4 Updated by Eric Faulhaber about 5 years ago

This is persistence context cleanup code. At the persistence level, we don't have any control as to the thread on which it runs. Part of the cleanup is to rollback any transaction which is open at the time the context ends. Hence the rollback and the call into the database. I guess something in the code which drives the context cleanup would have to change in order to guarantee this runs on the conversation thread.

#5 Updated by Eric Faulhaber about 5 years ago

The reason I assumed it was something in the new socket code is that we've had many abnormal ends in the past, where this cleanup code probably has run on a non-conversation thread. But it is only recently (starting probably around the time we switched to the AF_UNIX sockets) that this started being a problem with locking out new logins.

#6 Updated by Greg Shah about 5 years ago

I'm not saying you are incorrect. It may be a new "feature" of the UNIX socket code. But I think it is an invalid approach to rollback a transaction from one thread on another thread. If I'm missing something here, let me know.

Otherwise we probably need to ensure any rollback is executed on the conversation thread.

#7 Updated by Eric Faulhaber about 5 years ago

Greg Shah wrote:

Otherwise we probably need to ensure any rollback is executed on the conversation thread.

Agreed. I didn't actually realize this was happening on the reader thread until I took the thread dump. Do we have a helper/API to ensure something runs on the current context's conversation thread?

#8 Updated by Greg Shah about 5 years ago

Otherwise we probably need to ensure any rollback is executed on the conversation thread.

Agreed. I didn't actually realize this was happening on the reader thread until I took the thread dump. Do we have a helper/API to ensure something runs on the current context's conversation thread?

Not exactly. We do have Control.interrupt() which is a static method that will interrupt the conversation thread. This generates a STOP condition.

#9 Updated by Constantin Asofiei about 5 years ago

Greg Shah wrote:

Otherwise we probably need to ensure any rollback is executed on the conversation thread.

Agreed. I didn't actually realize this was happening on the reader thread until I took the thread dump. Do we have a helper/API to ensure something runs on the current context's conversation thread?

Not exactly. We do have Control.interrupt() which is a static method that will interrupt the conversation thread. This generates a STOP condition.

And to add something: we have 'critical sections' (like rollback/commit) which can't be interrupted by this STOP condition - it will be postponed until it finishes.

#10 Updated by Constantin Asofiei almost 5 years ago

Eric, lets 'switch' the question a little; why are we not already rolling back the tx in case of an abend, on the Conversation thread?

I'm looking at Persistence.cleanup and it has this code:

         if (session != null && session.isOpen())
         {
            if (isTransactionOpen(false))

Or, if we are already rolling back properly in case of an abend, who (and why) opens the tx (again) during the cleanup process?

#11 Updated by Greg Shah almost 5 years ago

Another question: do we ever need to run session cleanup on the Reader or Writer threads? If it always runs on the Conversation thread, that is enough, right?

#12 Updated by Eric Faulhaber almost 5 years ago

Greg Shah wrote:

Another question: do we ever need to run session cleanup on the Reader or Writer threads? If it always runs on the Conversation thread, that is enough, right?

From my (persistence) perspective, Conversation thread is enough, and in fact any cleanup code I've written for persistence classes has had the implicit assumption that it was running on the Conversation thread.

I'm not sure about the bigger picture question as to whether any cleanup needs to run on non-Conversation threads.

#13 Updated by Eric Faulhaber almost 5 years ago

Constantin Asofiei wrote:

Eric, lets 'switch' the question a little; why are we not already rolling back the tx in case of an abend, on the Conversation thread?

I'm looking at Persistence.cleanup and it has this code:
[...]

Or, if we are already rolling back properly in case of an abend, who (and why) opens the tx (again) during the cleanup process?

I don't know the answers. I wrote this code thinking we were rolling back on the Conversation thread. I don't know how this code comes to run on the Reader thread.

Note the comment:

               // Under normal circumstances, there should not be any tx open
               // at the context's end of life, but just in case...

So, this was meant to be a last-ditch effort to clean up the session. I don't actually remember the genesis of this cleanup code. Perhaps I found at one point that connections were being released back to the connection pool with transactions still open?

Anyway, we've been dealing with abends for years, but they never locked up the server before. This is a recent phenomenon. Did something change in the last month or two with the session cleanup mechanics, such that cleanup would run on the Reader thread that previously was running on the Conversation thread? It also is possible that something changed with the transaction logic, such that we end up in this rollback code whereas before we didn't. Or it could be that there is something about the AFI_UNIX socket code which is problematic in this scenario, such that we don't return from the NativeUnixSocket.read call.

Based solely on the timing of when this problem started, I suspect the introduction of the AFI_UNIX code.

Eric Faulhaber wrote:

Is the hung Reader thread preventing port 7449 from being released, because context cleanup is never finishing?

Yes, if you are using a single port (7449), a new web client can't be spawned unless this port is released - and if the thread is hung, then no other clients will be able to login.

I'm using the configuration that originally was set up for a particular app. I did not configure this originally. Is there a way to not use more than one port? Even so, if we did this, it seems we would just be working around the problem and we would eventually have a resource problem with a bunch of Reader threads stuck in NativeUnixSocket.read.

#14 Updated by Eric Faulhaber almost 5 years ago

4045a/11327 has an attempted workaround for this issue, but I'm not satisfied that it is a good solution. In Persistence$Context.cleanup, we check if our thread's name does not begin with "Conversation" and abort if so. I have no reliable recreate for the hang, so I have not been able to test this workaround. It will log a warning and a stack trace if it gets into this situation.

#15 Updated by Constantin Asofiei almost 5 years ago

Eric Faulhaber wrote:

4045a/11327 has an attempted workaround for this issue, but I'm not satisfied that it is a good solution. In Persistence$Context.cleanup, we check if our thread's name does not begin with "Conversation" and abort if so.

You can use Queue.isConversationThread instead of the name. Also, can you add a similar logging, to log when a session/tx is opened while not on the Conversation thread?

#16 Updated by Eric Faulhaber over 4 years ago

  • Related to Feature #4012: replacing TCP/IP (AF_INET) sockets with Unix Domain Sockets (AF_UNIX) added

#17 Updated by Greg Shah almost 4 years ago

I think we need to consider if there is ever a scenario in which context cleanup is needed for reader/writer threads. Perhaps this is needed for virtual sessions? For a normal interactive user session, I think cleanup should not be called on reader/writer threads.

Constantin: What are your thoughts?

#18 Updated by Greg Shah almost 4 years ago

I think we also need to exclude context-cleanup from happening on RelatedThread instances. Such threads share the context with the main thread.

#19 Updated by Constantin Asofiei almost 4 years ago

I agree that all context cleanup for interactive sessions must be on the main Conversation thread.

For the virtual session case, the things are a little more complicated. Explicit virtual session termination is done at the Dispatcher thread, when the requester calls the SessionManager.terminateRemote method; and the context cleanup will be performed there.

But, for virtual or non-virtual session termination caused by an 'unplanned' socket disconnect, the Reader (or Writer) thread will initiate the session termination:
  • for interactive sessions, it should interrupt and wait for the Conversation thread to finish and cleanup the context, too.
  • for the virtual session case, there is no other thread which would want to cleanup the context. So I think is OK to let whichever thread (Reader/Writer) picked up the cleanup to complete it.

#20 Updated by Constantin Asofiei almost 4 years ago

And there's one more case to consider for the 'unplanned' socket termination, for virtual sessions: if there is an active Dispatcher request doing work for that session. I can't tell how this will work, without some debugging.

#21 Updated by Greg Shah almost 4 years ago

for interactive sessions, it should interrupt and wait for the Conversation thread to finish and cleanup the context, too.

By "and cleanup the context, too", do you mean to say that the reader/writer thread should also wait for the cleanup to be finished?

Rules

  • Reader/Writer
    • On "unplanned" socket disconnect:
      • If it is an interactive session, then interrupt the Conversation thread and wait for that thread to complete context cleanup. Then exit without running context cleanup.
      • If it is a virtual session:
        • If there is no active request being processed, run context cleanup.
        • If there is an active request being processed, ? I suspect we need to interrupt it and block until context cleanup is executed on the Dispatcher thread.
    • On "planned" session exit:
      • Block until the Conversation thread or Dispatcher thread finishes context cleanup.
      • Then terminate without running any context cleanup.
  • Conversation - always expects to run context cleanup.
  • RelatedThread
    • We may need to run cleanup on things that are only used on this thread. I have to think about this a bit more.
    • Normal context cleanup definitely should be avoided.
  • AssociatedThread - always expects to run context cleanup.
  • Dispatcher - always expects to run context cleanup.

Are there any cases I have forgotten or gotten wrong?

#22 Updated by Constantin Asofiei almost 4 years ago

Greg Shah wrote:

for interactive sessions, it should interrupt and wait for the Conversation thread to finish and cleanup the context, too.

By "and cleanup the context, too", do you mean to say that the reader/writer thread should also wait for the cleanup to be finished?

Correct.

...
Are there any cases I have forgotten or gotten wrong?

I don't think so.

#24 Updated by Eric Faulhaber over 2 years ago

I was closing the web client of a large application in virtual desktop mode, debugging the update of the meta_connect and related metadata tables during database disconnection. I set a breakpoint in ConnectTableUpdater.databaseDisconnected and when I hit it, I was surprised to find that the Conversation thread was gone and I was on the associated Reader thread. Here's the stack trace from eclipse:

Daemon Thread [Reader [00000016:bogus]] (Suspended)    
    owns: Object  (id=1301)    
    owns: Queue  (id=1302)    
    owns: Object  (id=1303)    
    owns: Object  (id=1304)    
    ConnectTableUpdater.databaseDisconnected(int64) line: 403    
    ConnectionManager$ConnectionDispatcher.fireDisconnected(ConnectionManager$ConnectInfo) line: 4181    
    ConnectionManager$ConnectionDispatcher.cleanup() line: 4171    
    ConnectionManager$2.cleanup(ConnectionManager$ConnectionDispatcher) line: 326    
    ConnectionManager$2.cleanup(Object) line: 316    
    ContextLocal$Wrapper.cleanup() line: 779    
    SecurityContext.reset(Object) line: 688    
    SecurityContext.cleanupWorker(boolean, Map<ContextKey,Object>) line: 667    
    SecurityContext.cleanupWorker(boolean) line: 592    
    SecurityContext.cleanup() line: 570    
    SecurityManager.terminateSessionWorker(Session) line: 4314    
    SecurityManager.terminateSession(Object) line: 4218    
    DirectSession(BaseSession).cleanupContext() line: 415    
    DirectSession(BaseSession).end() line: 463    
    RouterSessionManager(SessionManager).endSession(BaseSession, boolean, boolean) line: 1354    
    RouterSessionManager(SessionManager).queueStopped(Queue) line: 1798    
    Queue.stop(Exception, boolean) line: 497    
    Protocol.stopQueue(Exception) line: 423    
    Protocol.access$700(Protocol, Exception) line: 165    
    Protocol$Reader.run() line: 532    
    Thread.run() line: 748    

Just wanted to preserve this for posterity, as I cannot reproduce it regularly. Mostly, the cleanup happens on the Conversation thread.

#25 Updated by Greg Shah over 2 years ago

I think we need to consider if there is ever a scenario in which context cleanup is needed for reader/writer threads. Perhaps this is needed for virtual sessions?

I've been thinking about this. For interactive sessions, we do need the reader and writer threads to have a context. It is needed so that we can implement the state synchronization. It calls into the LogicalTerminal to get changes on the writer thread and to apply changes on the reader thread. That must be run on a thread with security context.

I'm not as sure that it is needed for virtual sessions. We don't use state synchronization there. Any actual logic executed by a virtual session will be run on a dispatcher thread.

For a normal interactive user session, I think cleanup should not be called on reader/writer threads.

I still agree with this.

Looking more closely at the context cleanup processing, the reason for this seems clear. As noted above, reader/writer threads need a context. So they must call SecurityManager.terminateSession() when they exit. It is required to maintain proper session state for the SecurityManager. This is correctly called from BaseSession.end().

From SecurityManager.terminateSession(), we properly call SecurityManager.terminateSessionWorker() and it is this code that calculates whether or not cleanup will be executed:

      // unassign context and do cleanup if it's time
      boolean finalUse = unassignContext(ctx);
      if (finalUse)
      {
         // context cleanup happens in here ...

unassignContext() calls SecurityContext.unassign() which does a return use.decrementAndGet() 0. In other words, it will calculate finalUse true only in the scenario where the Reader or Writer thread is the last thread running which is using the context.

We have a race condition here. The Conversation thread must be exiting before the Reader thread and when the Conversation thread exits it is not the last thread using that context, so it doesn't run cleanup. The Reader thread comes along later and sees that it is the last thread, so it does run context cleanup. It is working as designed from the perspective of the security subsystem.

My ideas in #4071-21 are wrong. They are backwards. We really need the Conversation thread to wait until the Reader and Writer threads have exited. Then it should terminate its session which will cause cleanup to occur. I think the Conversation thread should also wait for the RelatedThreads to exit. For virtual sessions, we need to check if the reader/writer has context in those cases. As far as I know, those threads should not have context in that case. If that is correct, then there is no issue because cleanup will never happen there and the normal processing will occur for Dispatcher threads. If the reader/writer threads have context for virtual sessions then we need to consider why and if we can eliminate it.

#27 Updated by Greg Shah over 1 year ago

Eric: Do you recall the details on how this problem can be recreated? Tijs is asking for guidance in #6787-70.

If the race condition I identified in #4071-25 is the root cause, then the problem is going to be somewhat random. The timing of certain systems may make it hard to recreate.

#28 Updated by Eric Faulhaber over 1 year ago

Unfortunately, I never found a reliable way to recreate this condition, else it probably would be fixed already.

#29 Updated by Tijs Wickardt over 1 year ago

Okay. I'll try to hammer the system (testcases and/or <app>) concurrently, trying to overlap client shutdowns.

#30 Updated by Tijs Wickardt over 1 year ago

I'll also try to force the race condition Greg detected using a small sleep.

#31 Updated by Greg Shah about 1 year ago

  • Related to Bug #7193: READ-JSON is not supported for the "LONGCHAR" source type added

#32 Updated by Greg Shah about 1 year ago

I think the testcase in #7193-62 is another example of this same issue. Context cleanup should never run on the reader thread.

Also available in: Atom PDF