Project

General

Profile

Bug #3102

P2J client consumes all CPU in pseudoTerminalWait across multiple threads

Added by Eric Faulhaber almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Start date:
Due date:
% Done:

100%

billable:
No
vendor_id:
GCD
case_num:

client_activity.png (133 KB) Eric Faulhaber, 05/06/2016 07:38 PM

pseudoTerminalWait.png (110 KB) Eric Faulhaber, 05/06/2016 07:38 PM

threaddump-1462576146528.tdump (20.9 KB) Eric Faulhaber, 05/06/2016 07:38 PM

History

#1 Updated by Eric Faulhaber almost 8 years ago

I just ran through a batch test group in appserver mode. After the test group was finished, the server was completely idle, but one of the clients was still running flat out, using nearly 100% of my machine's CPU across all cores.

The P2J client had no child processes. I let it run for a while. It showed no signs of slowing after 10 minutes. I took a thread dump (attached) and did some CPU sampling to see where it was spending its time. It was in the native LaunchManager.pseudoTerminalWait method:

The client stopped when I killed the server, but was running continually until then.

#2 Updated by Eugenie Lyzenko almost 8 years ago

Eric,

1. Can you tell what command/program is launching inside child session?
2. Can the issue be reproduced permanently or only from time to time?

#3 Updated by Greg Shah almost 8 years ago

The problem is most likely with the newly added looping around the waitpid() call. I suspect we are missing an exit condition.

Eric: it is very important to know the recreate details ASAP.
Eugenie: this is your top priority as soon as you know how to recreate it.

#4 Updated by Eric Faulhaber almost 8 years ago

Eugenie Lyzenko wrote:

1. Can you tell what command/program is launching inside child session?

No, the Java client had no child processes. Based on the evidence above, I think it was looping infinitely in the native implementation of pseudoTerminalWait.

2. Can the issue be reproduced permanently or only from time to time?

I've only run the batch test group which triggered this once since the #2738 fix went in. It consists of 149 batch tests (some of them long-running) and took ~2 hours to complete. I haven't run it since, so I don't know if it's repeatable or not. Unfortunately, I don't know of any other way to recreate this.

I essentially did this in #2951-18, but with the equivalent (i.e., same code, but before I merged to trunk) of P2J trunk rev. 11024.

#5 Updated by Greg Shah almost 8 years ago

No, the Java client had no child processes.

It had one at one time otherwise we wouldn't be looping in the pseudoTerminalWait().

What is the tdump binary format? We need to see the thread dumps and know which child process was launched. Whatever it was, triggers some unexpected result where the child process exited but we don't detect it.

#6 Updated by Eric Faulhaber almost 8 years ago

Greg Shah wrote:

No, the Java client had no child processes.

It had one at one time otherwise we wouldn't be looping in the pseudoTerminalWait().

Of course, but I meant it no longer had any child processes after the batch tests had run, at the time I noticed the P2J client consuming the CPU.

What is the tdump binary format?

It's not binary, just a text file. tdump is just the default extension VisualVM assigned when I saved it.

#7 Updated by Eugenie Lyzenko almost 8 years ago

I think we getting into loop when res == -1(no child process exists) because we suppressed the stop/continue signals by

...
   signal(SIGCHLD, SIG_IGN);
...

in pseudoTerminalLaunch(). And in this can we could not have the proper status variable set, so loop exit conditions are not matching. I would suggest to change the loop this way adding result variable to check such a cases:
...
JNIEXPORT void JNICALL pseudoTerminalWait(JNIEnv* env, jclass cls, int pid, jboolean silent)
{
   pid_t child = (pid_t) pid;
   int   status;
   int rcWaitpid;

   do
   {
      rcWaitpid = waitpid(child, &status, WUNTRACED | WCONTINUED);
   }
   while (!WIFEXITED(status) && !WIFSIGNALED(status) && rcWaitpid != -1);
...

This will provide exit if no more child session.

#8 Updated by Greg Shah almost 8 years ago

It's not binary, just a text file. tdump is just the default extension VisualVM assigned when I saved it.

In the future, if you would please post it in Redmine as .txt, then we can view it in the browser itself.

The thread dump is weird. There are 8 threads in pseudoTerminalWait() and none that are in any clinet-side code waiting for the isAlive() notification.

"Thread-182" #197 prio=5 os_prio=0 tid=0x00007f5fe8891800 nid=0x257a runnable [0x00007f5fa4a40000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-181" #196 prio=5 os_prio=0 tid=0x00007f5fe87fe800 nid=0x2564 runnable [0x00007f5fa4b41000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-180" #195 prio=5 os_prio=0 tid=0x00007f5fe891a800 nid=0x2550 runnable [0x00007f5fa4c42000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-179" #194 prio=5 os_prio=0 tid=0x00007f5fe8834000 nid=0x250c runnable [0x00007f5fa4d43000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-178" #193 prio=5 os_prio=0 tid=0x00007f5fe80d0800 nid=0x250a runnable [0x00007f5fa4e44000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-177" #192 prio=5 os_prio=0 tid=0x00007f5fe80c6000 nid=0x2508 runnable [0x00007f5fa4f45000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-176" #191 prio=5 os_prio=0 tid=0x00007f5fe89f1800 nid=0x2505 runnable [0x00007f5fa5046000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-175" #190 prio=5 os_prio=0 tid=0x00007f5fe8802800 nid=0x2503 runnable [0x00007f5fa5147000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-174" #189 prio=5 os_prio=0 tid=0x00007f5fe8149800 nid=0x2501 runnable [0x00007f5fa5248000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-173" #188 prio=5 os_prio=0 tid=0x00007f5fe8b37800 nid=0x24fe runnable [0x00007f5fa5349000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"Thread-172" #187 prio=5 os_prio=0 tid=0x00007f5fe89ec000 nid=0x24fc runnable [0x00007f5fa585c000]
   java.lang.Thread.State: RUNNABLE
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(Native Method)
        at com.goldencode.p2j.util.LaunchManager.pseudoTerminalWait(LaunchManager.java:157)
        at com.goldencode.p2j.util.ProcessStream$Waiter.run(ProcessStream.java:1319)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

A common idiom is to launch a child process that outputs a single line (e.g. pwd). I guess what may be happening is that the server-side does a read of a single line (e.g. one import or set) and then moves on to other code, never trying to read again. The old approach would have exited the waiting thread and no one would ever reference the isAlive(), but that is OK. In the new failure mode, the waiting thread doesn't exit (in some cases) and then these threads accumulate.

And in this can we could not have the proper status variable set, so loop exit conditions are not matching. I would suggest to change the loop this way adding result variable to check such a cases

Eugenie: this seems reasonable, please go forward with this change.

#9 Updated by Eugenie Lyzenko almost 8 years ago

I know this is not good idea to mix several task into one. But due to urgency of this failure I can add the change into my current 3051b which has several steps completed and commit into the branch. So it can easily be tested if it works or not, then we can test 3051b and commit and then resume working on 3051 task with new 3051c branch.

Or I have to create separate 3102a branch for fixing this issue.

Let me know what do you think is better.

#10 Updated by Greg Shah almost 8 years ago

Please put it in 3102a.

#11 Updated by Eugenie Lyzenko almost 8 years ago

Created task branch 3102a from trunk revision 11025.

#12 Updated by Eugenie Lyzenko almost 8 years ago

Task branch 3102a for review updated to revision 11026.

The change is from note #7. Eric, is it possible to test it against the issue?

#13 Updated by Eric Faulhaber almost 8 years ago

Eugenie Lyzenko wrote:

The change is from note #7. Eric, is it possible to test it against the issue?

I don't know whether I can get to this today, but I will test it. Thanks.

#14 Updated by Greg Shah almost 8 years ago

Code Review Task Branch 3102a Revision 11026

I'm fine with the changes.

While Eric is testing this on the ETF, please go ahead and test this in MAJIC runtime regression testing.

#15 Updated by Eugenie Lyzenko almost 8 years ago

Greg Shah wrote:

While Eric is testing this on the ETF, please go ahead and test this in MAJIC runtime regression testing.

OK.

#16 Updated by Eugenie Lyzenko almost 8 years ago

The main part completed without regressions. Running the CTRL-C part.

#17 Updated by Eugenie Lyzenko almost 8 years ago

Testing completed, no regressions. The results - 3102a_11026_6c4b2ca_20160512_evl.zip.

#18 Updated by Greg Shah almost 8 years ago

Eric reports that the ETF batch tests (where this problem occurred) have successfully executed and the problem is gone.

Please merge 3102a to trunk.

#19 Updated by Eugenie Lyzenko almost 8 years ago

Branch 3102a was merged to trunk as revno 11026 then it was archived.

#20 Updated by Greg Shah almost 8 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed
  • Assignee set to Eugenie Lyzenko

#21 Updated by Greg Shah over 7 years ago

  • Target version changed from Milestone 11 to Cleanup and Stablization for Server Features

Also available in: Atom PDF