nginx -s reload fails / workers in process-state H
Igor Sysoev
igor at sysoev.ru
Wed Jun 9 18:25:58 MSD 2010
On Wed, Jun 09, 2010 at 03:46:41PM +0200, John Feuerstein wrote:
> Hello,
>
> using nginx-0.8.40 on Linux 2.6.32 I've encountered a problem with
> nginx reloading, which leaves [nginx] processes in process state "H"
> behind. As soon as the first of these processes is left behind, nginx
> is unable to reload the workers and is unable to stop gracefully.
> (happens with 0.8.36 and 0.8.37 too)
>
> Sending signals to the H-state processes won't do anything, leaving no
> other choice but to SIGKILL them.
>
> This behavior is very seldom and I've tried to track it down while
> having nginx logging with log level debug. It finally happened after
> reloading nginx the 68th time.
>
> The system was completely idle during the tests.
>
> Note that I've first encountered this while using cron to reload the
> nginx configuration every *5 minutes*. I'm not entirely sure if this is
> triggered by some kind of race-condition, where the last nginx reload
> was not completely finished and the next one is issued on top.
> I doubt it though: The debug log I've created shows the full process
> list after a reload. AFAIK nginx changes some proctitles while in
> reload-state, which I could not find anymore just seconds after issuing
> the reload signal. In addition, the system was idle (no requests...), so
> the old workers should fade in no time.
>
> The command used by cron to reload nginx is:
>
> nginx -t && nginx -s reload
>
>
> Steps to reproduce it (after stopping Cron and nginx):
>
> # Step 1: Start up nginx as usual, eg. /etc/init.d/nginx start
> # Step 2: Run the following script to issue the reload signal
> # every 20 seconds:
>
>
> ------------------------------- snip ---------------------------------
> #!/bin/bash
>
> log=/var/log/nginx/error_log
>
> echo "XXXXXXXXXX ------- SHOWING INITIAL NGINX PROCESSES -------
> XXXXXXXXXX" >> "$log"
> ps auxf 2>&1 | grep nginx >> "$log"
> echo "XXXXXXXXXX ------- END OF INITIAL NGINX PROCESSES -------
> XXXXXXXXXX" >> "$log"
>
> i=1
> while sleep 10
> do
> echo "XXXXXXXXXX ------- RELOAD $i ------- XXXXXXXXXX" >> "$log"
>
> nginx -t && nginx -s reload
> sleep 10
>
> echo "XXXXXXXXXX ------- RELOAD $i complete - SHOWING NGINX
> PROCESSES ------- XXXXXXXXXX" >> "$log"
> ps auxf 2>&1 | grep nginx >> "$log"
> echo "XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX"
> >> "$log"
>
> ((i++))
> done
> ------------------------------- snap --------------------------------
>
> As far as I understand, the reload should be complete when the process
> list is appended to the error_log 10 seconds after reloading nginx
> *and* it does not contain any proctitles like "nginx: reloading
> workers..".
>
> # Step 3: watch -n1 ps auxf ... and wait until a nginx process in
> # state "H" appears.
>
> # Step 4: CTRL-C the debug script, stop nginx and SIGKILL all "H"-state
> processes.
>
> # Step 5: Examine the error_log.
>
> This debug log is about 21M in size, gzipped down to 1M. I've uploaded
> it here to not pollute the mailing list:
>
> http://biz.baze.de/files/nginx-debug.log.gz
>
> The first appearance of an "H"-state process is after reload 68 on line
> number 300630. From there on, the nginx reload is broken and more of
> these processes accumulate.
>
> To find the next marker, just search for "XXXXXXXXXX". Search for
> " H " (leading and trailing space) to find the next process list with
> such a process.
>
> For the record, here is the process list after reload 68, with broken
> process (pid 11488).
>
> XXXXXXXXXX ------- RELOAD 68 complete - SHOWING NGINX PROCESSES -------
> XXXXXXXXXX
> root 30864 0.0 0.0 12780 1344 pts/9 S+ 13:27 0:00
> \_ /bin/bash ./nginx-debug.sh
> root 11894 0.0 0.0 9800 860 pts/9 S+ 13:50
> 0:00 \_ grep nginx
> root 11895 0.0 0.0 12780 496 pts/9 S+ 13:50
> 0:00 \_ /bin/bash ./nginx-debug.sh
> root 30776 0.1 0.0 101668 11332 ? Ss 13:27 0:02 nginx:
> master process /usr/sbin/nginx -c /etc/nginx/n…
> nginx 11488 0.0 0.0 0 0 ? H 13:49 0:00 \_
> [nginx]
> nginx 11835 0.0 0.0 101668 11876 ? S 13:50 0:00 \_
> nginx: worker process
> nginx 11836 0.0 0.0 101668 12096 ? S 13:50 0:00 \_
> nginx: worker process
> nginx 11837 0.0 0.0 101668 11008 ? S 13:50 0:00 \_
> nginx: cache manager process
> nginx 11838 0.0 0.0 101668 10988 ? S 13:50 0:00 \_
> nginx: cache loader process
> XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX
Thank you for the detailed report.
It seems that "H" statted is a rebranded "Z" state.
I see the following in the log:
Two workers exit:
2010/06/09 13:50:03 [notice] 11487#0: exit
2010/06/09 13:50:03 [notice] 11488#0: exit
A master received the only SIGCHLD signal:
2010/06/09 13:50:03 [notice] 30776#0: signal 17 (SIGCHLD) received
2010/06/09 13:50:03 [notice] 30776#0: worker process 11487 exited with code 0
2010/06/09 13:50:03 [debug] 30776#0: wake up, sigio 0
2010/06/09 13:50:03 [debug] 30776#0: reap children
2010/06/09 13:50:03 [debug] 30776#0: child: 0 11835 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 1 11836 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 2 11837 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 3 11838 e:0 t:0 d:0 r:0 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 4 11487 e:1 t:1 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11835
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11836
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11837
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11838
The master is even able to send a message without error to already
finished 11488:
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487 to:11488
2010/06/09 13:50:03 [debug] 30776#0: child: 5 11488 e:1 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 6 -1 e:1 t:1 d:0 r:1 j:0
Then the master calls sigsuspend():
2010/06/09 13:50:03 [debug] 30776#0: sigsuspend
and waits forever: it did not receive 11488's SIGCHLD to call waitpid()
and to escape 11488 from zombie state, it did not receive any SIGHUP
further.
I believe something is broken in Linux 2.6.32 signal delivery.
--
Igor Sysoev
http://sysoev.ru/en/
More information about the nginx
mailing list