Nginx health checks

Tue Jun 10 21:36:25 MSD 2008

On Tue, Jun 10, 2008 at 10:21 AM, Barry Abrahamson <barry at automattic.com> wrote:
> Howdy,
>
> First let me say how happy we are with Nginx :)  Yesterday was a pretty big
> traffic day for us, about 36 million dynamic pageviews peaking at about 15k
> requests/sec.  Couldn't have done it without Nginx.
>
> http://wordpress.com/stats/traffic/
>
> We have about 350 web servers behind Nginx so it is a semi-regular
> occurrence that one of them fails for some reason (usually hardware).  Pound
> has a dedicated health check thread, that would perform the health checks
> and then mark servers up/down as appropriate.

> Nginx, however, seems to use
> the response (or lack thereof) from the user-initiated request as the health
> check.

Can you elaborate on what you mean here?  Things aren't getting marked
as down from pound until a user makes a request?

> The problem with this is that a % (max_fails/fail_timeout) of user
> responses will be slowed down by this.  To minimize the impact we could set
> our timeouts lower and increase fail_timeout, but this could be dangerous --
> in the event of a problem with a backend service (database, etc) which
> results in slow responses from all web servers, nginx could mark every
> server as down for an extended period of time.

It seems like this should be tuned at the pound level.  You should
definitely have sane thresholds for timeouts, but it sounds like your
load balancer software is letting you down here.

> I was wondering if there has
> been any thought about adding a dedicated health check thread to nginx to
> avoid affecting user requests.  This could also in theory allow for more
> advanced and customizable health checks.
>
> Another option would be to use another program such as keepalived to do the
> health checks and then modify the nginx config on the fly, but it seems less
> than ideal.
>

We use keepalived tcp checks and it works pretty well.  We face a
problem that the appservers we reverse proxy to might be down so it's
not a spectacular health check.  We experimented with http checks
going to a dynamically generated (but really light) appserver page and
we actually found it to be less reliable than the tcp checks.

In your pound config are you using tcp, dynamic http checks or a
static page pull?  Are you guys using fcgis or some other appserver
type thing for php?

> We are a company of PHP developers, so hacking on Nginx's beautiful C code
> is not our forte, but I would be open to sponsoring some development in this
> area if someone is interested and the community thinks it would be useful.

Very cool. :)

>
> Thoughts?
> --
> Barry Abrahamson | Systems Wrangler | Automattic
> Blog: http://barry.wordpress.com
>
>

-- 
Corey Donohoe
http://www.atmos.org/
http://www.engineyard.com/