Surviving Digg?

Wed Apr 30 03:01:02 MSD 2008

On Tue, Apr 29, 2008 at 2:07 PM, Aleksandar Lazic <al-nginx at none.at> wrote:
> Hi Neil,
>
>
>  On Die 29.04.2008 13:38, Neil Sheth wrote:
>
> >
> > We hit the front page of digg the other night, and our servers didn't
> > handle it well at all.  Here's a little of what happened, and perhaps
> > someone has some suggestions on what to tweak!
> >
> > Basic setup, nginx 0.5.35, serving up static image content, and then
> > passing php requests to 2 backend servers running apache, all running
> > red hat el4.
> >
>
>  What was/is the network settings on the maschines?

What specific settings are you asking about?

>
>
> > Now, we started seeing the following:
> > upstream timed out (110: Connection timed out) while connecting to
> > upstream
> >
>
>
>  What was the load on the backends?
>  What are the settings of apache?
>  Have you take a looke about
>
>  netstat -nt
>
>  how many FIN* things do you have?

Right now, shows about 60.  Not sure what the count of FIN objects was
at the time of the digg.  I did run the following (found in a forum
somewhere, to give connection counts by IP):
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr
This showed the number of connections to the backend servers to be
almost 1000 each.

>
>
>
> > So, perhaps the 2 backend servers couldn't handle the load?  We were
> > serving the page mostly out of memcache at this point.  In any case,
> > couldn't figure out why that wasn't sufficient, so we replaced the page
> > with a static html one.
> >
> > This seemed to help, but we were now seeing a lot of these:
> >  connect() failed (113: No route to host) while connecting to upstream
> >  no live upstreams while connecting to upstream
> >
>
>  Have you put names or ip-addresses into the nginx config?
IP addresses

>
>
> > This wasn't on every request, but a significant percentage.  This, we
> > couldn't figure out.  Why couldn't it connect to the backend servers?
> > We ended up rebooting both of the backend servers, and these errors
> > stopped.
> >
>
>  Again load and netstat?!
Load didn't actually look that bad, if I recall.  Probably peaks
around 4 while this was occuring, but generally lower.

>  Cheers
>
>  Aleks
>
>
Thanks for the help!