All workers in 'D' state using sendfile

Fri May 25 04:57:39 UTC 2012

Hi Maxim,

Thanks for your reply and sorry for the delay in responding!

I've applied your suggested changes to three servers in the cluster -
hopefully that will give me an accurate idea of their effectiveness.  I'll
report back when I have more useful info.

Thanks again,

Drew

On Sat, May 12, 2012 at 9:18 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Sat, May 12, 2012 at 08:28:14PM +1000, Drew Wareham wrote:
>
> > Hello,
> >
> > I have tried to summarize this as much as possible but it's still a lot
> of
> > text.  I apologize but wanted to make sure that I provide enough
> > information to explain the issue properly.
> >
> > I'm hoping that somebody that uses nginx as a high traffic/concurrency
> > download server will be able to shed some light on this issue.  I've
> tried
> > as many things as I can think of and everything keeps pointing to it
> being
> > an issue with nginx, not the server - but I am of course more than
> willing
> > to try any suggestions provided.
> >
> > *Background:*
> > Approx. 1,500 - 5,000 concurrent connections (peak / off-peak),
> > Files vary in size from 5MB to 2GB,
> > All downloads; only very small dynamic content scripts run on these
> servers
> > and none take more than 1-3 seconds,
> > File are hosted on direct-attached AoE storage with a dedicated 10GE
> link,
> > Server is running nginx-1.0.11, php-fpm 5.3 and CentOS 5.8x64
> > (2.6.18-308.4.1.el5.centos.plus).
> > Specs are: Dual Xeon E5649 (6 Core), 32GB RAM, 300GB 10k SAS HDD, AoE DAS
> > over 10GE
> > Download speeds are restricted by the PHP handoff using X-Accel-Redirect,
> > but obviously not when I'm testing ;)
> >
> > *Issue:*
> > After running for a short, but random period of time (5min ~ 90min) all
> > nginx workers will eventually end up in a 'D' state according to ps/top.
> > This causes all downloads to run extremely slowly (~25kb/s) but it
> doesn't
> > seem to be caused by I/O because an scp of the same file will complete at
> > the expected speed of ~750MB+/s.
> >
> > I usually run with worker_processes set to 13, but I've had to raise this
> > to 50 to prevent the issue.  This works short term, but I'm guessing
> > eventually I will need to restart nginx to fix it.
> >
> > *Config:*
> > I'm using sendfile with epoll, and using the following events / http
> > settings (I've removed the location block with the fastcgi handler, etc):
>
> With rotational disks you have to optimize iops to minimize seeks.
> This includes:
>
> 1. Switch off sendfile, it works bad on such workloads under linux
> due to no ability to control readahead (and hence blocks read from
> disk).
>
> 2. Use large output buffers, something like
>
>    output_buffers 1 512k
>
> would be a good starting point.
>
> 3. Try using aio to ensure better disk concurrency (and note under
> linux it needs directio as well), i.e. something like this
>
>    aio on;
>    directio 512;
>
> (this will require newer kernel though, but using 2.6.18 nowadays
> looks like bad idea, at least if you need speed)
>
> 4. Try tuning io scheduler, there have been reports that deadline
> might be better for such workloads.
>
> More details can be found here:
>
> http://nginx.org/r/output_buffers
> http://nginx.org/r/aio
> http://nginx.org/r/directio
>
> Maxim Dounin
>
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20120525/4fa15790/attachment-0001.html>