Weird 0.8.11.1 connections spike

Mon Aug 31 16:14:12 MSD 2009

Igor Sysoev wrote:
> On Sun, Aug 30, 2009 at 10:55:57PM -0400, Jim Ohlstein wrote:
>
>   
>> Igor Sysoev wrote:
>>     
>>> On Sun, Aug 30, 2009 at 11:52:51AM -0400, Jim Ohlstein wrote:
>>>
>>>  
>>>       
>>>>>> 2009/08/30 10:29:00 [alert] 2042#0: open socket #1023 left in 
>>>>>> connection 1015
>>>>>> 2009/08/30 10:29:00 [alert] 2042#0: aborting
>>>>>>
>>>>>> Other servers seem to be running fine including ones with busy sites. 
>>>>>> For the moment I have reverted that VPS to 0.8.10.
>>>>>>   
>>>>>>        
>>>>>>             
>>>>> Could you do the following:
>>>>>
>>>>> 1) enable coredumps
>>>>> 2) set in nginx.conf:
>>>>>  debug_points  abort;
>>>>> 3) reconfigure nginx, if there are open connections, then nginx creates
>>>>>  coredump on exit
>>>>>
>>>>>      
>>>>>           
>>>> Do you want nginx reconfigured "--with-debug" or is there another option 
>>>> you need?
>>>>    
>>>>         
>>> No. The coredump is enough, it's just should have debug info (gcc -g 
>>> option).
>>>
>>>  
>>>       
>>>>> 4) look in log for alerts: open socket #... left in connection NN
>>>>> 5) run "gdb /path/to/nginx /path/to/core", then
>>>>>
>>>>>  p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
>>>>>  p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count
>>>>>
>>>>>  where NN is NN from log message.
>>>>>      
>>>>>           
>> Unfortunately I don't think it gave too much information.
>>
>> I watched connections gradually rise. I have ulimit -n set to 1024, two 
>> workers, 1024 connections/worker. As connections neared 2048 the site 
>> became unresponsive and load went up dramatically.
>>
>> I began to see the same errors in the log. Nginx did not abort on its 
>> own so I killed it after a few minutes. I then saw the same entries in 
>> the error log like:
>>
>> 2009/08/30 22:22:40 [alert] 6118#0: open socket #980 left in connection 993
>>     
>
> nginx aborts only when you send -HUP and it found leaked connections.
>
>   
>> I ran gdb on the core but this was the output from three connections:
>>
>> [root at mars proc]# gdb /vz/private/101/fs/root/usr/local/sbin/nginx ./kcore
>> GNU gdb Red Hat Linux (6.5-37.el5_2.2rh)
>> Copyright (C) 2006 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain 
>> conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host 
>> libthread_db library "/lib64/libthread_db.so.1".
>>
>> warning: core file may not match specified executable file.
>> Core was generated by `ro root=LABEL=/ console=tty0 
>> console=ttyS1,19200n8 debug'.
>> #0  0x0000000000000000 in ?? ()
>> (gdb) p ((ngx_connection_t *) ngx_cycle->connections[1014]->data)->uri
>> Cannot access memory at address 0x130
>> (gdb) p ((ngx_connection_t *) ngx_cycle->connections[1014]->data)->uri
>> Cannot access memory at address 0x130
>> (gdb) p ((ngx_connection_t *) ngx_cycle->connections[1010]->data)->uri
>> Cannot access memory at address 0x130
>> (gdb) p ((ngx_connection_t *) 
>> ngx_cycle->connections[1014]->data)->main->count
>> Cannot access memory at address 0x130
>> (gdb)  p ((ngx_connection_t *) 
>> ngx_cycle->connections[1010]->data)->main->count
>> Cannot access memory at address 0x130
>> (gdb) p ((ngx_connection_t *) ngx_cycle->connections[993]->data)->uri
>> Cannot access memory at address 0x130
>> (gdb) p ((ngx_connection_t *) 
>> ngx_cycle->connections[993]->data)->main->count
>> Cannot access memory at address 0x130
>> (gdb) quit
>> [root at mars proc]#
>>
>> During this time there were hundreds of connections in "CLOSE_WAIT" 
>> state. They gradually increased to just over 1000 when it crashed.
>>     
>
> Sorry, I've mistaked:
>
> p ((ngx_http_request_t *) ngx_cycle->connections[1014].data)->uri
> p ((ngx_http_request_t *) ngx_cycle->connections[1014].data)->main->count
>
>
>   
It looks as though you got the data that you needed overnight in my time 
zone. That server does use a try_files directive:

location /forums/ {
    try_files  $uri  $uri/  /forums/vbseo.php;
    ...
}

Previously we used a rewrite:

#if (!-e $request_filename) {
#rewrite ^/forums/(.*)$ /forums/vbseo.php last;
#}

which ironically would probably not have caused this difficulty.

I'll try 0.8.12 and report if any difficulties unless you want me to 
generate another coredump with 0.8.11

Jim