Do you use the upstream hash module in any of your active upstreams?<div><br></div><div>Can your provide the full upstream configuration ?<br><div><br><div class="gmail_quote">2011/8/15 Maxim Dounin <span dir="ltr"><<a href="mailto:mdounin@mdounin.ru">mdounin@mdounin.ru</a>></span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hello!<br>

<div class="im"><br>

On Mon, Aug 15, 2011 at 02:59:36PM -0000, Oded Arbel wrote:<br>

<br>

> Regarding the above mentioned patch (also quoted below), I<br>

> wanted to provide feedback on this:<br>

><br>

> On my system, we have several reverse proxy servers running<br>

> Nginx and forwarding requests to upstream. Our configuration<br>

> looks like this:<br>

> upstream trc {<br>

>       server prod2-f1:10213 max_fails=500 fail_timeout=30s;<br>

>       server prod2-f2:10213 max_fails=500 fail_timeout=30s;<br>

>         ...<br>

>       server <a href="http://127.0.0.1:10213" target="_blank">127.0.0.1:10213</a> backup;<br>

>       ip_hash;<br>

<br>

</div>Ip hash balancer doesn't support "backup" servers (and it will<br>

complain loudly if you place "ip_hash" before servers).  Could you<br>

please check if you still see the problem after removing backup<br>

server?<br>

<div class="im"><br>

> }<br>

><br>

> We've noticed that every once in a while (about 5-10 times a<br>

> week) one of the servers gets into a state where an Nginx worker<br>

> starts eating 100% CPU and timing out on requests. I've applied<br>

> the aforementioned patch to our Nginx installation (release<br>

> 1.0.0 with the Nginx_Upstream_Hash patch) and deployed to our<br>

<br>

</div>You mean the one from Evan Miller's upstream hash module, as<br>

available at <a href="http://wiki.nginx.org/HttpUpstreamRequestHashModule" target="_blank">http://wiki.nginx.org/HttpUpstreamRequestHashModule</a>?<br>

<div class="im"><br>

> production servers. After a few hours, we started having the<br>

> Nginx workers on all the servers eat 100% CPU.<br>

><br>

> Connecting with gdb to one of the problematic worker I got this<br>

> backtrace:<br>

> #0  0x000000000044a650 in ngx_http_upstream_get_round_robin_peer ()<br>

> #1  0x00000000004253dc in ngx_event_connect_peer ()<br>

> #2  0x0000000000448618 in ngx_http_upstream_connect ()<br>

> #3  0x0000000000448e10 in ngx_http_upstream_process_header ()<br>

> #4  0x00000000004471fb in ngx_http_upstream_handler ()<br>

> #5  0x00000000004247fa in ngx_event_expire_timers ()<br>

> #6  0x00000000004246ed in ngx_process_events_and_timers ()<br>

> #7  0x000000000042a048 in ngx_worker_process_cycle ()<br>

> #8  0x00000000004287e0 in ngx_spawn_process ()<br>

> #9  0x000000000042963c in ngx_start_worker_processes ()<br>

> #10 0x000000000042a5d5 in ngx_master_process_cycle ()<br>

> #11 0x0000000000410adf in main ()<br>

><br>

> I then tried tracing through the running worker using the GDB<br>

> command "next", which said:<br>

> Single stepping until exit from function<br>

> ngx_http_upstream_get_round_robin_peer<br>

><br>

> And never returned until I got fed up and broke it.<br>

><br>

> I finally reverted the patch and restarted the service, and<br>

> continue to get this behavior. So my conclusion is that for my<br>

> specific problem, this patch does not solve it.<br>

<br>

</div>Your problem is different from one the patch is intended to solve.<br>

The patch solves one (and only one) problem where all servers are<br>

marked "down" in config, clearly not the case you have.<br>

<font color="#888888"><br>

Maxim Dounin<br>

</font><div><div></div><div class="h5"><br>

_______________________________________________<br>

nginx-devel mailing list<br>

<a href="mailto:nginx-devel@nginx.org">nginx-devel@nginx.org</a><br>

<a href="http://mailman.nginx.org/mailman/listinfo/nginx-devel" target="_blank">http://mailman.nginx.org/mailman/listinfo/nginx-devel</a><br>

</div></div></blockquote></div><br></div></div>