[PATCH 21 of 31] Fix cpu hog with all upstream servers marked "down"

Mon Aug 15 17:51:14 UTC 2011

Do you use the upstream hash module in any of your active upstreams?

Can your provide the full upstream configuration ?

2011/8/15 Maxim Dounin <mdounin at mdounin.ru>

> Hello!
>
> On Mon, Aug 15, 2011 at 02:59:36PM -0000, Oded Arbel wrote:
>
> > Regarding the above mentioned patch (also quoted below), I
> > wanted to provide feedback on this:
> >
> > On my system, we have several reverse proxy servers running
> > Nginx and forwarding requests to upstream. Our configuration
> > looks like this:
> > upstream trc {
> >       server prod2-f1:10213 max_fails=500 fail_timeout=30s;
> >       server prod2-f2:10213 max_fails=500 fail_timeout=30s;
> >         ...
> >       server 127.0.0.1:10213 backup;
> >       ip_hash;
>
> Ip hash balancer doesn't support "backup" servers (and it will
> complain loudly if you place "ip_hash" before servers).  Could you
> please check if you still see the problem after removing backup
> server?
>
> > }
> >
> > We've noticed that every once in a while (about 5-10 times a
> > week) one of the servers gets into a state where an Nginx worker
> > starts eating 100% CPU and timing out on requests. I've applied
> > the aforementioned patch to our Nginx installation (release
> > 1.0.0 with the Nginx_Upstream_Hash patch) and deployed to our
>
> You mean the one from Evan Miller's upstream hash module, as
> available at http://wiki.nginx.org/HttpUpstreamRequestHashModule?
>
> > production servers. After a few hours, we started having the
> > Nginx workers on all the servers eat 100% CPU.
> >
> > Connecting with gdb to one of the problematic worker I got this
> > backtrace:
> > #0  0x000000000044a650 in ngx_http_upstream_get_round_robin_peer ()
> > #1  0x00000000004253dc in ngx_event_connect_peer ()
> > #2  0x0000000000448618 in ngx_http_upstream_connect ()
> > #3  0x0000000000448e10 in ngx_http_upstream_process_header ()
> > #4  0x00000000004471fb in ngx_http_upstream_handler ()
> > #5  0x00000000004247fa in ngx_event_expire_timers ()
> > #6  0x00000000004246ed in ngx_process_events_and_timers ()
> > #7  0x000000000042a048 in ngx_worker_process_cycle ()
> > #8  0x00000000004287e0 in ngx_spawn_process ()
> > #9  0x000000000042963c in ngx_start_worker_processes ()
> > #10 0x000000000042a5d5 in ngx_master_process_cycle ()
> > #11 0x0000000000410adf in main ()
> >
> > I then tried tracing through the running worker using the GDB
> > command "next", which said:
> > Single stepping until exit from function
> > ngx_http_upstream_get_round_robin_peer
> >
> > And never returned until I got fed up and broke it.
> >
> > I finally reverted the patch and restarted the service, and
> > continue to get this behavior. So my conclusion is that for my
> > specific problem, this patch does not solve it.
>
> Your problem is different from one the patch is intended to solve.
> The patch solves one (and only one) problem where all servers are
> marked "down" in config, clearly not the case you have.
>
> Maxim Dounin
>
> _______________________________________________
> nginx-devel mailing list
> nginx-devel at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20110816/f2d6ce1f/attachment.html>