[PATCH 21 of 31] Fix cpu hog with all upstream servers marked "down"

Mon Aug 15 15:59:39 UTC 2011

Hello!

On Mon, Aug 15, 2011 at 02:59:36PM -0000, Oded Arbel wrote:

> Regarding the above mentioned patch (also quoted below), I 
> wanted to provide feedback on this:
> 
> On my system, we have several reverse proxy servers running 
> Nginx and forwarding requests to upstream. Our configuration 
> looks like this:
> upstream trc {
> 	server prod2-f1:10213 max_fails=500 fail_timeout=30s;
> 	server prod2-f2:10213 max_fails=500 fail_timeout=30s;
>         ...
> 	server 127.0.0.1:10213 backup;
> 	ip_hash;

Ip hash balancer doesn't support "backup" servers (and it will 
complain loudly if you place "ip_hash" before servers).  Could you 
please check if you still see the problem after removing backup 
server?

> }
> 
> We've noticed that every once in a while (about 5-10 times a 
> week) one of the servers gets into a state where an Nginx worker 
> starts eating 100% CPU and timing out on requests. I've applied 
> the aforementioned patch to our Nginx installation (release 
> 1.0.0 with the Nginx_Upstream_Hash patch) and deployed to our 

You mean the one from Evan Miller's upstream hash module, as 
available at http://wiki.nginx.org/HttpUpstreamRequestHashModule?  

> production servers. After a few hours, we started having the 
> Nginx workers on all the servers eat 100% CPU.
> 
> Connecting with gdb to one of the problematic worker I got this 
> backtrace:
> #0  0x000000000044a650 in ngx_http_upstream_get_round_robin_peer ()
> #1  0x00000000004253dc in ngx_event_connect_peer ()
> #2  0x0000000000448618 in ngx_http_upstream_connect ()
> #3  0x0000000000448e10 in ngx_http_upstream_process_header ()
> #4  0x00000000004471fb in ngx_http_upstream_handler ()
> #5  0x00000000004247fa in ngx_event_expire_timers ()
> #6  0x00000000004246ed in ngx_process_events_and_timers ()
> #7  0x000000000042a048 in ngx_worker_process_cycle ()
> #8  0x00000000004287e0 in ngx_spawn_process ()
> #9  0x000000000042963c in ngx_start_worker_processes ()
> #10 0x000000000042a5d5 in ngx_master_process_cycle ()
> #11 0x0000000000410adf in main ()
> 
> I then tried tracing through the running worker using the GDB 
> command "next", which said:
> Single stepping until exit from function 
> ngx_http_upstream_get_round_robin_peer
> 
> And never returned until I got fed up and broke it.
> 
> I finally reverted the patch and restarted the service, and 
> continue to get this behavior. So my conclusion is that for my 
> specific problem, this patch does not solve it.

Your problem is different from one the patch is intended to solve.  
The patch solves one (and only one) problem where all servers are 
marked "down" in config, clearly not the case you have.

Maxim Dounin