Nginx health checks

Tue Jun 10 20:21:21 MSD 2008

Howdy,

First let me say how happy we are with Nginx :)  Yesterday was a  
pretty big traffic day for us, about 36 million dynamic pageviews  
peaking at about 15k requests/sec.  Couldn't have done it without Nginx.

http://wordpress.com/stats/traffic/

We have about 350 web servers behind Nginx so it is a semi-regular  
occurrence that one of them fails for some reason (usually hardware).   
Pound has a dedicated health check thread, that would perform the  
health checks and then mark servers up/down as appropriate.  Nginx,  
however, seems to use the response (or lack thereof) from the user- 
initiated request as the health check.  The problem with this is that  
a % (max_fails/fail_timeout) of user responses will be slowed down by  
this.  To minimize the impact we could set our timeouts lower and  
increase fail_timeout, but this could be dangerous -- in the event of  
a problem with a backend service (database, etc) which results in slow  
responses from all web servers, nginx could mark every server as down  
for an extended period of time.  I was wondering if there has been any  
thought about adding a dedicated health check thread to nginx to avoid  
affecting user requests.  This could also in theory allow for more  
advanced and customizable health checks.

Another option would be to use another program such as keepalived to  
do the health checks and then modify the nginx config on the fly, but  
it seems less than ideal.

We are a company of PHP developers, so hacking on Nginx's beautiful C  
code is not our forte, but I would be open to sponsoring some  
development in this area if someone is interested and the community  
thinks it would be useful.

Thoughts?
--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com