Surviving Digg?

Wed Apr 30 00:38:13 MSD 2008

We hit the front page of digg the other night, and our servers didn't
handle it well at all.  Here's a little of what happened, and perhaps
someone has some suggestions on what to tweak!

Basic setup, nginx 0.5.35, serving up static image content, and then
passing php requests to 2 backend servers running apache, all running
red hat el4.

Looking at the nginx error log -

First, we saw a lot of entries like the following:
 socket() failed (24: Too many open files) while connecting to upstream
 accept() failed (24: Too many open files) while accepting new connection
 open() "/var/www/html/images/imagefile.jpg" failed (24: Too many open files)

Running ulimit -n showed 1024, so set that to 32768 on all 3 servers.
Also raised limit in /etc/security/limits.conf.

Now, we started seeing the following:
 upstream timed out (110: Connection timed out) while connecting to upstream

So, perhaps the 2 backend servers couldn't handle the load?  We were
serving the page mostly out of memcache at this point.  In any case,
couldn't figure out why that wasn't sufficient, so we replaced the
page with a static html one.

This seemed to help, but we were now seeing a lot of these:
  connect() failed (113: No route to host) while connecting to upstream
  no live upstreams while connecting to upstream

This wasn't on every request, but a significant percentage.  This, we
couldn't figure out.  Why couldn't it connect to the backend servers?
We ended up rebooting both of the backend servers, and these errors
stopped.

Any thoughts / comments anyone has?  Thanks!