All workers in 'D' state using sendfile

Drew Wareham m3rlin at gmail.com
Mon May 28 07:07:46 UTC 2012


Hi,

Unfortunately the changes did not seem to be the silver bullet I was hoping
for - however - although the processes are still sleeping, and the servers
are still getting very high loads, it does seem to have helped - downloads
are no longer unable to start, which is great!

I'm not sure if it's helpful, but an strace on a sleeping process looks
like this:

io_submit(47044328767488, 1, {{0x13dab860, 0, 0, 0, 66}}) = 1
epoll_wait(9, {{EPOLLOUT, {u32=1554189657, u64=47044330982745}}, {EPOLLOUT,
{u32=1554179536, u64=47044330972624}}}, 512, 36) = 2
writev(250,
[{"\321\4\vS\0313\237F\222\337\246\322\33(\30\245=g\352\271\2\361\244p\240\377Q\314\2371^\256"...,
161636}], 1) = 20440
writev(69,
[{"\235\337}\33\201\214)\321\343\233\22\346$z\374\2126T\\j\210\250L\250\331{\220\333\323\343\341J"...,
386840}], 1) = 130320
epoll_wait(9, {{EPOLLOUT, {u32=1554194624, u64=47044330987712}}}, 512, 9) =
1
writev(222,
[{"<\247\260`P\237\2455\236\244\352!\237s\223h\25\206N3[{\351f\31\275\344b\5\204\f\v"...,
396976}], 1) = 95568
epoll_wait(9, {{EPOLLOUT, {u32=1554205480, u64=47044330998568}}}, 512, 26)
= 1
writev(286,
[{"E\260q\214\346X[\376\305\5\275\352\344`\256q\327\344m\r\236\t\321\354\200\325\333\351E\340\374\232"...,
240116}], 1) = 64240
epoll_wait(9, {{EPOLLOUT, {u32=1554195361, u64=47044330988449}}}, 512, 25)
= 1
writev(133,
[{"\243Y\373y\10\0252\34\32\22\2\36\227\325e\345\333\372=\365./\340\34V\251U\0373\234\35\250"...,
13732}], 1) = 13732
io_submit(47044328767488, 1, {{0x12599628, 0, 0, 0, 209}}) = 1
epoll_wait(9, {{EPOLLOUT, {u32=1554199961, u64=47044330993049}}}, 512, 5) =
1
writev(49,
[{"+\347^\17\322\354\201\20=\35\246b\200\0\214K'z>\344k\331\272Svh\234`\334)|\205"...,
176592}], 1) = 84120
epoll_wait(9, {}, 512, 1)               = 0
epoll_wait(9, {}, 512, 4)               = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554179905, u64=47044330972993}}}, 512, 14)
= 1
epoll_wait(9, {{EPOLLOUT, {u32=1554193521, u64=47044330986609}}}, 512, 10)
= 1
writev(137,
[{"\212\375\216\330'\315^\20|\350N\362\25j\272\304=v\227\210?\3539S\343\6D\265C-\360J"...,
336856}], 1) = 96360
epoll_wait(9, {{EPOLLOUT, {u32=1554181193, u64=47044330974281}}}, 512, 9) =
1
writev(79,
[{"\321\277\340\360E\323A\352\352\377\357w\357m_\377\377R\0\200\177\365l\200
\314D\24z\21U\0"..., 228056}], 1) = 128480
epoll_wait(9, {}, 512, 3)               = 0
epoll_wait(9, {}, 512, 8)               = 0
epoll_wait(9, {}, 512, 2)               = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554204009, u64=47044330997097}}}, 512, 26)
= 1
writev(67, [{"\204-&
V\325?\375\33\202B\236\216\r\240\360\17\0103\25\274\3\300>\352\267\211BJ\265\23\327"...,
166588}], 1) = 26280
epoll_wait(9, {}, 512, 12)              = 0
epoll_wait(9, {{EPOLLIN, {u32=6779072, u64=6779072}}}, 512, 14) = 1
read(10, "\1\0\0\0\0\0\0\0", 8)         = 8
io_getevents(47044328767488, 1, 64, {{0x13dab860, 0x13dab820, 524288, 0}},
{0, 0}) = 1
writev(80,
[{"X\361N8\2\214\203n\263t\240\\\335\241k\212N\366\24\222\32\201u\267\272\32\v\326=\373\34\v"...,
524288}], 1) = 56608
epoll_wait(9, {{EPOLLOUT, {u32=1554187265, u64=47044330980353}}}, 512, 13)
= 1
epoll_wait(9, {{EPOLLOUT, {u32=1554183217, u64=47044330976305}}}, 512, 3) =
1
epoll_wait(9, {}, 512, 1)               = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554181744, u64=47044330974832}}}, 512, 13)
= 1
writev(121,
[{"\371s\222d\231\313\17\t\227\31\33a\315\304\365NZ7\323\200\347\337\260\355\253\203\30\215N\313\260d"...,
331027}], 1) = 49640
epoll_wait(9, {}, 512, 2)               = 0
epoll_wait(9, {}, 512, 8)               = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554192968, u64=47044330986056}}}, 512, 12)
= 1
epoll_wait(9, {{EPOLLIN, {u32=6779072, u64=6779072}}}, 512, 12) = 1
read(10, "\1\0\0\0\0\0\0\0", 8)         = 8
io_getevents(47044328767488, 1, 64, {{0x12599628, 0x125995e8, 524288, 0}},
{0, 0}) = 1
writev(133,
[{"&yj\373dw\335\364\232k\310\6\204\363\365=c{V\257\6:\225\354\233\253b\27*\221\4\264"...,
524288}], 1) = 114748
epoll_wait(9, {{EPOLLOUT, {u32=1554203825, u64=47044330996913}}}, 512, 6) =
1
epoll_wait(9, {{EPOLLOUT, {u32=1554179536, u64=47044330972624}}}, 512, 5) =
1
writev(69,
[{"X\4\250\274\3415\212A\20D\30\2122.\23\351n%\226\245\250\242b$\271\t\22/fX\303\263"...,
256520}], 1) = 130320
epoll_wait(9, {{EPOLLOUT, {u32=1554192784, u64=47044330985872}}}, 512, 27)
= 1


An strace -c taken only when a process is in the 'D' state shows:

[root at HOST16 ~]# time strace -p 22334 -c
Process 22334 attached - interrupt to quit
Process 22334 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 70.65    0.008273         109        76           io_submit
 29.35    0.003437          10       360           writev
  0.00    0.000000           0        26           read
  0.00    0.000000           0         3           open
  0.00    0.000000           0         8           close
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1         1 connect
  0.00    0.000000           0        11         3 recvfrom
  0.00    0.000000           0         1           getsockname
  0.00    0.000000           0         3           getsockopt
  0.00    0.000000           0        44           fcntl
  0.00    0.000000           0        26           io_getevents
  0.00    0.000000           0       180           epoll_wait
  0.00    0.000000           0         4           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.011710                   748         4 total

real    0m8.570s
user    0m0.016s
sys     0m0.028s

When it's out of sleeping state, about 25% of the time is spent in
epoll_wait.  I've also noticed that vmstat shows far less frequent
swapping, although now instead of 5-10MB regularly, it'll swap ~100MB every
30+ seconds.


Cheers,

Drew

On Fri, May 25, 2012 at 2:57 PM, Drew Wareham <m3rlin at gmail.com> wrote:

> Hi Maxim,
>
> Thanks for your reply and sorry for the delay in responding!
>
> I've applied your suggested changes to three servers in the cluster -
> hopefully that will give me an accurate idea of their effectiveness.  I'll
> report back when I have more useful info.
>
>
> Thanks again,
>
> Drew
>
>
>
> On Sat, May 12, 2012 at 9:18 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:
>
>> Hello!
>>
>> On Sat, May 12, 2012 at 08:28:14PM +1000, Drew Wareham wrote:
>>
>> > Hello,
>> >
>> > I have tried to summarize this as much as possible but it's still a lot
>> of
>> > text.  I apologize but wanted to make sure that I provide enough
>> > information to explain the issue properly.
>> >
>> > I'm hoping that somebody that uses nginx as a high traffic/concurrency
>> > download server will be able to shed some light on this issue.  I've
>> tried
>> > as many things as I can think of and everything keeps pointing to it
>> being
>> > an issue with nginx, not the server - but I am of course more than
>> willing
>> > to try any suggestions provided.
>> >
>> > *Background:*
>> > Approx. 1,500 - 5,000 concurrent connections (peak / off-peak),
>> > Files vary in size from 5MB to 2GB,
>> > All downloads; only very small dynamic content scripts run on these
>> servers
>> > and none take more than 1-3 seconds,
>> > File are hosted on direct-attached AoE storage with a dedicated 10GE
>> link,
>> > Server is running nginx-1.0.11, php-fpm 5.3 and CentOS 5.8x64
>> > (2.6.18-308.4.1.el5.centos.plus).
>> > Specs are: Dual Xeon E5649 (6 Core), 32GB RAM, 300GB 10k SAS HDD, AoE
>> DAS
>> > over 10GE
>> > Download speeds are restricted by the PHP handoff using
>> X-Accel-Redirect,
>> > but obviously not when I'm testing ;)
>> >
>> > *Issue:*
>> > After running for a short, but random period of time (5min ~ 90min) all
>> > nginx workers will eventually end up in a 'D' state according to ps/top.
>> > This causes all downloads to run extremely slowly (~25kb/s) but it
>> doesn't
>> > seem to be caused by I/O because an scp of the same file will complete
>> at
>> > the expected speed of ~750MB+/s.
>> >
>> > I usually run with worker_processes set to 13, but I've had to raise
>> this
>> > to 50 to prevent the issue.  This works short term, but I'm guessing
>> > eventually I will need to restart nginx to fix it.
>> >
>> > *Config:*
>> > I'm using sendfile with epoll, and using the following events / http
>> > settings (I've removed the location block with the fastcgi handler,
>> etc):
>>
>> With rotational disks you have to optimize iops to minimize seeks.
>> This includes:
>>
>> 1. Switch off sendfile, it works bad on such workloads under linux
>> due to no ability to control readahead (and hence blocks read from
>> disk).
>>
>> 2. Use large output buffers, something like
>>
>>    output_buffers 1 512k
>>
>> would be a good starting point.
>>
>> 3. Try using aio to ensure better disk concurrency (and note under
>> linux it needs directio as well), i.e. something like this
>>
>>    aio on;
>>    directio 512;
>>
>> (this will require newer kernel though, but using 2.6.18 nowadays
>> looks like bad idea, at least if you need speed)
>>
>> 4. Try tuning io scheduler, there have been reports that deadline
>> might be better for such workloads.
>>
>> More details can be found here:
>>
>> http://nginx.org/r/output_buffers
>> http://nginx.org/r/aio
>> http://nginx.org/r/directio
>>
>> Maxim Dounin
>>
>> _______________________________________________
>> nginx mailing list
>> nginx at nginx.org
>> http://mailman.nginx.org/mailman/listinfo/nginx
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20120528/3e28596f/attachment-0001.html>


More information about the nginx mailing list