[PATCH 21 of 31] Fix cpu hog with all upstream servers marked "down"

Mon Aug 15 15:09:38 UTC 2011

----- Original Message -----
> Regarding the above mentioned patch (also quoted below), I wanted to
> provide feedback on this:
> 
> On my system, we have several reverse proxy servers running Nginx and
> forwarding requests to upstream. Our configuration looks like this:
> upstream trc {
> 	server prod2-f1:10213 max_fails=500 fail_timeout=30s;
> 	server prod2-f2:10213 max_fails=500 fail_timeout=30s;
>         ...
> 	server 127.0.0.1:10213 backup;
> 	ip_hash;
> }
> 
> We've noticed that every once in a while (about 5-10 times a week)
> one of the servers gets into a state where an Nginx worker starts
> eating 100% CPU and timing out on requests. I've applied the
> aforementioned patch to our Nginx installation (release 1.0.0 with
> the Nginx_Upstream_Hash patch) and deployed to our production
> servers. After a few hours, we started having the Nginx workers on
> all the servers eat 100% CPU.
> 
> Connecting with gdb to one of the problematic worker I got this
> backtrace:
> #0  0x000000000044a650 in ngx_http_upstream_get_round_robin_peer ()
> #1  0x00000000004253dc in ngx_event_connect_peer ()
> #2  0x0000000000448618 in ngx_http_upstream_connect ()
> #3  0x0000000000448e10 in ngx_http_upstream_process_header ()
> #4  0x00000000004471fb in ngx_http_upstream_handler ()
> #5  0x00000000004247fa in ngx_event_expire_timers ()
> #6  0x00000000004246ed in ngx_process_events_and_timers ()
> #7  0x000000000042a048 in ngx_worker_process_cycle ()
> #8  0x00000000004287e0 in ngx_spawn_process ()
> #9  0x000000000042963c in ngx_start_worker_processes ()
> #10 0x000000000042a5d5 in ngx_master_process_cycle ()
> #11 0x0000000000410adf in main ()
> 
> I then tried tracing through the running worker using the GDB command
> "next", which said:
> Single stepping until exit from function
> ngx_http_upstream_get_round_robin_peer
> 
> And never returned until I got fed up and broke it.
> 
> I finally reverted the patch and restarted the service, and continue
> to get this behavior. So my conclusion is that for my specific
> problem, this patch does not solve it.

Additionally:

1) I believe that my problem is related to the fact that I have 25% of the upstream servers configured in the "down" state (due to some unrelated work on those servers). I've just removed the "down" servers and restarted, and I will see if that will prevent the problem from happening.
2) the trigger for the problem is continuous load on the servers over a length of time - with minimal load or with occasional spikes, the servers performs fine. The reason is likely that under more then moderate load, the upstream application servers have a relatively high request failure rate (something like 2-3%) which causes upstream applications servers to always go in and out of the "down" state automatically, so the list of "up" servers is always in flux.

-- 
Oded <oded at geek.co.il>