Stop handling SIGTERM and zombie processes after reconfigure

Florian S. f_los_ch at yahoo.com
Thu Jul 4 11:00:27 UTC 2013


Hi again,

On 03.07.2013 17:38, Maxim Dounin wrote:
> Hello!
>
> On Wed, Jul 03, 2013 at 04:48:29PM +0200, Florian S. wrote:
>
>> Hi together!
>>
>> I'm having occasionally trouble with worker processes left <defunct>
>> and nginx stopping handling signals (HUP and even TERM) in general.
>>
>> Upon reconfigure signal, the log shows four new processes being
>> spawned, while the old four processes are shutting down:
>>
>>> [notice] 5159#0: using the "epoll" event method
>>> [notice] 5159#0: nginx/1.4.1
>>> [notice] 5159#0: built by gcc 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1)
>>> [notice] 5159#0: OS: Linux 3.9.7-147-x86
>>> [notice] 5159#0: getrlimit(RLIMIT_NOFILE): 100000:100000
>>> [notice] 5159#0: start worker processes
>>> [notice] 5159#0: start worker process 5330
>>> [notice] 5159#0: start worker process 5331
>>> [notice] 5159#0: start worker process 5332
>>> [notice] 5159#0: start worker process 5333
>>> [notice] 5159#0: signal 1 (SIGHUP) received, reconfiguring
>>> [notice] 5159#0: reconfiguring
>>> [notice] 5159#0: using the "epoll" event method
>>> [notice] 5159#0: start worker processes
>>> [notice] 5159#0: start worker process 12457
>>> [notice] 5159#0: start worker process 12458
>>> [notice] 5159#0: start worker process 12459
>>> [notice] 5159#0: start worker process 12460
>>> [notice] 5159#0: start cache manager process 12461
>>> [notice] 5159#0: start cache loader process 12462
>>> [notice] 5331#0: gracefully shutting down
>>> [notice] 5330#0: gracefully shutting down
>>> [notice] 5331#0: exiting
>>> [notice] 5330#0: exiting
>>> [notice] 5331#0: exit
>>> [notice] 5330#0: exit
>>> [notice] 5332#0: gracefully shutting down
>>> [notice] 5159#0: signal 17 (SIGCHLD) received
>>> [notice] 5159#0: worker process 5331 exited with code 0
>>> [notice] 5332#0: exiting
>>> [notice] 5332#0: exit
>>> [notice] 5333#0: gracefully shutting down
>>> [notice] 5333#0: exiting
>>> [notice] 5333#0: exit
>>
>> After that, nginx is fully operational and serving requests --
>> however, ps yields:
>>
>>> root    5159 0.0 0.0 6248 1696 ?     Ss 10:43 0:00 nginx: master
>> process /chroots/nginx/nginx -c /chroots/nginx/conf/nginx.conf
>>> nobody  5330 0.0 0.0    0    0 ?     Z  10:43 0:00 [nginx] <defunct>
>>> nobody  5332 0.0 0.0    0    0 ?     Z  10:43 0:00 [nginx] <defunct>
>>> nobody  5333 0.0 0.0    0    0 ?     Z  10:43 0:00 [nginx] <defunct>
>>> nobody 12457 0.0 0.0 8332 2940 ?     S  10:44 0:00 nginx: worker process
>>> nobody 12458 0.0 0.0 8332 2940 ?     S  10:44 0:00 nginx: worker process
>>> nobody 12459 0.0 0.0 8332 3544 ?     S  10:44 0:00 nginx: worker process
>>> nobody 12460 0.0 0.0 8332 2940 ?     S  10:44 0:00 nginx: worker process
>>> nobody 12461 0.0 0.0 6296 1068 ?     S  10:44 0:00 nginx: cache
>> manager process
>>> nobody 12462 0.0 0.0    0    0 ?     Z  10:44 0:00 [nginx] <defunct>
>>
>> In the log one can see that SIGCHLD is only received once for 5331,
>> which does not show up as zombie -- in contrast to the workers 5330,
>> 5332, 5333, and the cache loader 12462.
>> Much more serious is that neither
>>
>>> /chroots/nginx/nginx -c /chroots/nginx/conf/nginx.conf -s(stop|reload)
>>
>> nor
>>
>>> kill 5159
>>
>> seem to get handled by nginx anymore (nothing in the log and no
>> effect). Maybe the master process is stuck waiting for some mutex?:
>>
>>> strace -p 5159
>>> Process 5159 attached - interrupt to quit
>>> futex(0xb7658e6c, FUTEX_WAIT_PRIVATE, 2, NULL
>>
>> Unfortunately, I missed to get a core dump of the master process
>> while it was running. Additionally, there is no debug log available,
>> sorry. As I was not able to reliably reproduce this issue, I'll most
>> probably have to wait...
>
> It indeed looks like the master process is blocked somewhere.  It
> would be interesting to see stack trace of a master process when
> this happens.
>
> (It's also good idea to make sure there are no 3rd party
> modules/patches, just in case.)
>

Thanks for your quick reply.
I finally managed to get a core dump (I killed the master process using 
signal 11 in order to enforce the dump, thats why gdb claims the segfault):

> Program terminated with signal 11, Segmentation fault.
> #0  0xb772c430 in dl_main (phdr=0x5, phnum=1, user_entry=0x80a97f9, auxv=0xbfd0956c) at rtld.c:1751
> 1751	rtld.c: Datei oder Verzeichnis nicht gefunden.
> (gdb) bt
> #0  0xb772c430 in dl_main (phdr=0x5, phnum=1, user_entry=0x80a97f9, auxv=0xbfd0956c) at rtld.c:1751
> #1  0xb7523bc6 in ?? ()
> #2  0x00000005 in ?? ()
> #3  0x00000001 in ?? ()
> #4  0x080a97f9 in ?? ()
> #5  0x0804c370 in syslog (__fmt=0x80a97f9 "%.*s", __pri=<optimized out>) at /usr/include/bits/syslog.h:32
> #6  ngx_log_error_core (level=6, log=0x967f084, fn=0x80adba2 "ngx_signal_handler", file=0x80ad731 "src/os/unix/ngx_process.c", line=430, err=0, fmt=0x80ad74b "signal %d (%s) received%s") at src/core/ngx_log.c:249
> #7  0x0806b890 in ngx_signal_handler (signo=17) at src/os/unix/ngx_process.c:429
> #8  0xb772c400 in dl_main (phdr=0x5, phnum=1, user_entry=0x80a97f9, auxv=0xbfd0a1ec) at rtld.c:1735
> #9  0xb7523bc6 in ?? ()
> #10 0x00000005 in ?? ()
> #11 0x00000001 in ?? ()
> #12 0x080a97f9 in ?? ()
> #13 0x0804c370 in syslog (__fmt=0x80a97f9 "%.*s", __pri=<optimized out>) at /usr/include/bits/syslog.h:32
> #14 ngx_log_error_core (level=6, log=0x967f084, fn=0x80adba2 "ngx_signal_handler", file=0x80ad731 "src/os/unix/ngx_process.c", line=430, err=0, fmt=0x80ad74b "signal %d (%s) received%s") at src/core/ngx_log.c:249
> #15 0x0806b890 in ngx_signal_handler (signo=29) at src/os/unix/ngx_process.c:429
> #16 0xb772c400 in dl_main (phdr=0xbfd0b0f0, phnum=3218125184, user_entry=0x10, auxv=0x967f084) at rtld.c:1735
> #17 0x0806f0da in ngx_master_process_cycle (cycle=0x967f078) at src/os/unix/ngx_process_cycle.c:169
> #18 0x0804b95c in main (argc=3, argv=0xbfd0b394) at src/core/nginx.c:417
> (gdb)

Maybe the the concurrently running handlers for SIGCHLD and SIGIO lead 
to some blocking in dl_main? However, I am not aware of the side-effects 
and exact purpose of the dynamic linking at this point.

And as you can see, I did not mention that I have the (semi-official?) 
syslog patch applied, which might indeed cause the problem when called 
from the signal handler. As you already pointed out, it seems to be a 
good idea to remove this patch and try to check whether the error persists.

Kind regards,
Florian



More information about the nginx-devel mailing list