[PATCH 21 of 31] Fix cpu hog with all upstream servers marked "down"

Oded Arbel oded at geek.co.il
Mon Aug 15 14:59:36 UTC 2011


Regarding the above mentioned patch (also quoted below), I wanted to provide feedback on this:

On my system, we have several reverse proxy servers running Nginx and forwarding requests to upstream. Our configuration looks like this:
upstream trc {
	server prod2-f1:10213 max_fails=500 fail_timeout=30s;
	server prod2-f2:10213 max_fails=500 fail_timeout=30s;
        ...
	server 127.0.0.1:10213 backup;
	ip_hash;
}

We've noticed that every once in a while (about 5-10 times a week) one of the servers gets into a state where an Nginx worker starts eating 100% CPU and timing out on requests. I've applied the aforementioned patch to our Nginx installation (release 1.0.0 with the Nginx_Upstream_Hash patch) and deployed to our production servers. After a few hours, we started having the Nginx workers on all the servers eat 100% CPU.

Connecting with gdb to one of the problematic worker I got this backtrace:
#0  0x000000000044a650 in ngx_http_upstream_get_round_robin_peer ()
#1  0x00000000004253dc in ngx_event_connect_peer ()
#2  0x0000000000448618 in ngx_http_upstream_connect ()
#3  0x0000000000448e10 in ngx_http_upstream_process_header ()
#4  0x00000000004471fb in ngx_http_upstream_handler ()
#5  0x00000000004247fa in ngx_event_expire_timers ()
#6  0x00000000004246ed in ngx_process_events_and_timers ()
#7  0x000000000042a048 in ngx_worker_process_cycle ()
#8  0x00000000004287e0 in ngx_spawn_process ()
#9  0x000000000042963c in ngx_start_worker_processes ()
#10 0x000000000042a5d5 in ngx_master_process_cycle ()
#11 0x0000000000410adf in main ()

I then tried tracing through the running worker using the GDB command "next", which said:
Single stepping until exit from function ngx_http_upstream_get_round_robin_peer

And never returned until I got fed up and broke it.

I finally reverted the patch and restarted the service, and continue to get this behavior. So my conclusion is that for my specific problem, this patch does not solve it.

-- 
Oded <oded at geek.co.il>


diff --git a/src/http/ngx_http_upstream_round_robin.c b/src/http/ngx_http_upstream_round_robin.c
--- a/src/http/ngx_http_upstream_round_robin.c
+++ b/src/http/ngx_http_upstream_round_robin.c
@@ -583,7 +583,7 @@ failed:
 static ngx_uint_t
 ngx_http_upstream_get_peer(ngx_http_upstream_rr_peers_t *peers)
 {
-    ngx_uint_t                    i, n;
+    ngx_uint_t                    i, n, reset = 0;
     ngx_http_upstream_rr_peer_t  *peer;
 
     peer = &peers->peer[0];
@@ -622,6 +622,10 @@ ngx_http_upstream_get_peer(ngx_http_upst
             return n;
         }
 
+        if (reset++) {
+            return 0;
+        }
+
         for (i = 0; i < peers->number; i++) {
             peer[i].current_weight = peer[i].weight;
         }



More information about the nginx-devel mailing list