Surviving Digg?

Wed Apr 30 11:08:52 MSD 2008

On Tue, Apr 29, 2008 at 01:38:13PM -0700, Neil Sheth wrote:

> We hit the front page of digg the other night, and our servers didn't
> handle it well at all.  Here's a little of what happened, and perhaps
> someone has some suggestions on what to tweak!
> 
> Basic setup, nginx 0.5.35, serving up static image content, and then
> passing php requests to 2 backend servers running apache, all running
> red hat el4.
> 
> Looking at the nginx error log -
> 
> First, we saw a lot of entries like the following:
>  socket() failed (24: Too many open files) while connecting to upstream
>  accept() failed (24: Too many open files) while accepting new connection
>  open() "/var/www/html/images/imagefile.jpg" failed (24: Too many open files)
> 
> Running ulimit -n showed 1024, so set that to 32768 on all 3 servers.
> Also raised limit in /etc/security/limits.conf.

You need to tune your OS: to increase number of files, sockets, etc.
I can not say about Linux, but here is my tunning for FreeBSD/amd64, 4G
for large number of sockets/etc:
http://lists.freebsd.org/pipermail/freebsd-net/2008-April/017737.html

> Now, we started seeing the following:
>  upstream timed out (110: Connection timed out) while connecting to upstream
> 
> So, perhaps the 2 backend servers couldn't handle the load?  We were
> serving the page mostly out of memcache at this point.  In any case,
> couldn't figure out why that wasn't sufficient, so we replaced the
> page with a static html one.

Yes, it seems that your backend can not handle load.

> This seemed to help, but we were now seeing a lot of these:
>   connect() failed (113: No route to host) while connecting to upstream
>   no live upstreams while connecting to upstream
> 
> This wasn't on every request, but a significant percentage.  This, we
> couldn't figure out.  Why couldn't it connect to the backend servers?
> We ended up rebooting both of the backend servers, and these errors
> stopped.
> 
> Any thoughts / comments anyone has?  Thanks!

The "113: No route to host" is network error, it might be appeared while
backend rebooting.

-- 
Igor Sysoev
http://sysoev.ru/en/