nginx -s reload fails / workers in process-state H

Igor Sysoev igor at sysoev.ru
Wed Jun 9 18:25:58 MSD 2010


On Wed, Jun 09, 2010 at 03:46:41PM +0200, John Feuerstein wrote:

> Hello,
> 
> using nginx-0.8.40 on Linux 2.6.32 I've encountered a problem with
> nginx reloading, which leaves [nginx] processes in process state "H"
> behind. As soon as the first of these processes is left behind, nginx
> is unable to reload the workers and is unable to stop gracefully.
> (happens with 0.8.36 and 0.8.37 too)
> 
> Sending signals to the H-state processes won't do anything, leaving no
> other choice but to SIGKILL them.
> 
> This behavior is very seldom and I've tried to track it down while
> having nginx logging with log level debug. It finally happened after
> reloading nginx the 68th time.
> 
> The system was completely idle during the tests.
> 
> Note that I've first encountered this while using cron to reload the
> nginx configuration every *5 minutes*. I'm not entirely sure if this is
> triggered by some kind of race-condition, where the last nginx reload
> was not completely finished and the next one is issued on top.
> I doubt it though: The debug log I've created shows the full process
> list after a reload. AFAIK nginx changes some proctitles while in
> reload-state, which I could not find anymore just seconds after issuing
> the reload signal. In addition, the system was idle (no requests...), so
> the old workers should fade in no time.
> 
> The command used by cron to reload nginx is:
> 
> 	nginx -t && nginx -s reload
> 
> 
> Steps to reproduce it (after stopping Cron and nginx):
> 
> # Step 1: Start up nginx as usual, eg. /etc/init.d/nginx start
> # Step 2: Run the following script to issue the reload signal
> #         every 20 seconds:
> 
> 
> ------------------------------- snip ---------------------------------
> #!/bin/bash
> 
> log=/var/log/nginx/error_log
> 
> echo "XXXXXXXXXX ------- SHOWING INITIAL NGINX PROCESSES -------
> XXXXXXXXXX" >> "$log"
> ps auxf 2>&1 | grep nginx >> "$log"
> echo "XXXXXXXXXX ------- END OF INITIAL NGINX PROCESSES -------
> XXXXXXXXXX" >> "$log"
> 
> i=1
> while sleep 10
> do
>     echo "XXXXXXXXXX ------- RELOAD $i -------    XXXXXXXXXX" >> "$log"
> 
>     nginx -t && nginx -s reload
>     sleep 10
> 
>     echo "XXXXXXXXXX ------- RELOAD $i complete - SHOWING NGINX
> PROCESSES ------- XXXXXXXXXX" >> "$log"
>     ps auxf 2>&1 | grep nginx >> "$log"
>     echo "XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX" 
> >> "$log"
> 
>    ((i++))
> done
> ------------------------------- snap --------------------------------
> 
> As far as I understand, the reload should be complete when the process
> list is appended to the error_log 10 seconds after reloading nginx
> *and* it does not contain any proctitles like "nginx: reloading
> workers..".
> 
> # Step 3: watch -n1 ps auxf     ... and wait until a nginx process in
> #                                   state "H" appears.
> 
> # Step 4: CTRL-C the debug script, stop nginx and SIGKILL all "H"-state
> processes.
> 
> # Step 5: Examine the error_log.
> 
> This debug log is about 21M in size, gzipped down to 1M. I've uploaded
> it here to not pollute the mailing list:
> 
> 	http://biz.baze.de/files/nginx-debug.log.gz
> 
> The first appearance of an "H"-state process is after reload 68 on line
> number 300630. From there on, the nginx reload is broken and more of
> these processes accumulate.
> 
> To find the next marker, just search for "XXXXXXXXXX". Search for
> " H " (leading and trailing space) to find the next process list with
> such a process.
> 
> For the record, here is the process list after reload 68, with broken
> process (pid 11488).
> 
> XXXXXXXXXX ------- RELOAD 68 complete - SHOWING NGINX PROCESSES -------
> XXXXXXXXXX
> root     30864  0.0  0.0  12780  1344 pts/9    S+   13:27   0:00
> \_ /bin/bash ./nginx-debug.sh
> root     11894  0.0  0.0   9800   860 pts/9    S+   13:50
> 0:00          \_ grep nginx
> root     11895  0.0  0.0  12780   496 pts/9    S+   13:50
> 0:00          \_ /bin/bash ./nginx-debug.sh
> root     30776  0.1  0.0 101668 11332 ?        Ss   13:27   0:02 nginx:
> master process /usr/sbin/nginx -c /etc/nginx/n…
> nginx    11488  0.0  0.0      0     0 ?        H    13:49   0:00  \_
> [nginx]
> nginx    11835  0.0  0.0 101668 11876 ?        S    13:50   0:00  \_
> nginx: worker process
> nginx    11836  0.0  0.0 101668 12096 ?        S    13:50   0:00  \_
> nginx: worker process
> nginx    11837  0.0  0.0 101668 11008 ?        S    13:50   0:00  \_
> nginx: cache manager process
> nginx    11838  0.0  0.0 101668 10988 ?        S    13:50   0:00  \_
> nginx: cache loader process
> XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX

Thank you for the detailed report.
It seems that "H" statted is a rebranded "Z" state.
I see the following in the log:

Two workers exit:

2010/06/09 13:50:03 [notice] 11487#0: exit
2010/06/09 13:50:03 [notice] 11488#0: exit

A master received the only SIGCHLD signal:

2010/06/09 13:50:03 [notice] 30776#0: signal 17 (SIGCHLD) received
2010/06/09 13:50:03 [notice] 30776#0: worker process 11487 exited with code 0
2010/06/09 13:50:03 [debug] 30776#0: wake up, sigio 0
2010/06/09 13:50:03 [debug] 30776#0: reap children
2010/06/09 13:50:03 [debug] 30776#0: child: 0 11835 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 1 11836 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 2 11837 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 3 11838 e:0 t:0 d:0 r:0 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 4 11487 e:1 t:1 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11835
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11836
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11837
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11838

The master is even able to send a message without error to already
finished 11488:

2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11488
2010/06/09 13:50:03 [debug] 30776#0: child: 5 11488 e:1 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 6 -1 e:1 t:1 d:0 r:1 j:0

Then the master calls sigsuspend():

2010/06/09 13:50:03 [debug] 30776#0: sigsuspend

and waits forever: it did not receive 11488's SIGCHLD to call waitpid()
and to escape 11488 from zombie state, it did not receive any SIGHUP
further.

I believe something is broken in Linux 2.6.32 signal delivery.


-- 
Igor Sysoev
http://sysoev.ru/en/



More information about the nginx mailing list