upstream timeouts I can not explain

Руслан Закиров ruz at sports.ru
Wed Jan 11 19:25:31 UTC 2017


11 Янв 2017 г. 21:40 пользователь "Maxim Dounin" <mdounin at mdounin.ru>
написал:

Hello!

On Wed, Jan 11, 2017 at 08:54:31PM +0300, Руслан Закиров wrote:

> On Wed, Jan 11, 2017 at 6:44 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:
>
> > Hello!
> >
> > On Wed, Jan 11, 2017 at 06:07:01PM +0300, Руслан Закиров wrote:
> >
> > > On Mon, Jan 9, 2017 at 10:59 PM, Maxim Dounin <mdounin at mdounin.ru>
> > wrote:
> > >
> > > > Typical kern.sched.quantum is about 100ms, so several
> > > > CPU-intensive tasks can delay processing of the events enough to
> > > > trigger a timeout if a context switch happens at a bad time.
> > > >
> > >
> > > Here what I see in truss' output:
> > >
> > > 38.820523207 0.000006568 kevent(28,{ },0,{
> > > 198,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821405071 },512,{ 6.215000000
})
> > = 1
> > > (0x1)
> > > 39.783094188 0.000022875 kevent(28,{ },0,{
> > > 52,EVFILT_READ,0x0,0x0,0x30b,0x81f800068
> > > 204,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821401588
> > > 51,EVFILT_READ,0x0,0x0,0xec,0x81f800000
> > > 68,EVFILT_READ,EV_CLEAR,0x0,0x8bf,0x81f816580
> > > 7,EVFILT_READ,EV_CLEAR,0x0,0x27f,0x81f813869
> > > 57,EVFILT_READ,EV_CLEAR,0x0,0x767,0x81f817bd8
> > > 203,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x248,0x81f8030c1
> > > 181,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x9b77,0x81f80ea68
> > > 178,EVFILT_READ,EV_CLEAR,0x0,0x39d,0x81f8010a9
> > > 198,EVFILT_READ,EV_CLEAR,0x0,0x3d3,0x81f805071
> > > 204,EVFILT_READ,EV_CLEAR,0x0,0x9da,0x81f801588
> > > 190,EVFILT_READ,EV_CLEAR,0x0,0x4ff,0x81f80fc48
> > > 154,EVFILT_READ,EV_CLEAR,0x0,0x88e,0x81f8130b1
> > > 151,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xc1db,0x81f814290
> > > 157,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xe841,0x81f80c029
> > > 195,EVFILT_READ,EV_CLEAR,0x0,0x952,0x81f8090a1
> > > 194,EVFILT_READ,EV_CLEAR,0x0,0x929,0x81f809ac8
> > > 201,EVFILT_READ,EV_CLEAR,0x0,0x4ef,0x81f80c980
> > > 174,EVFILT_READ,EV_CLEAR,0x0,0x51e,0x81f816518
> > > 77,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x1168,0x81f811c61 },512,{
> > 5.253000000
> > > }) = 20 (0x14)
> > >
> > >
> > > 1 second delay between two syscalls. Then nginx goes nuts processing
all
> > it
> > > missed during this second. I can not tell from this output how much
time
> > > was spent in these syscalls. Can anyone?
> >
> > Using ktrace / kdump might be a better option, it shows both
> > syscall enter and syscall return with exact timestamps.
> >
>
> Tried. Found at least one issue with sendfile blocking:
>
>  66193 nginx    1484152162.182601 CALL
>  openat(AT_FDCWD,0x80639aac4,0x4<O_NONBLOCK>,<unused>0)
>  66193 nginx    1484152162.182607 NAMI  ...
>  66193 nginx    1484152162.182643 RET   openat 40/0x28
> ... gz lookup and stats ...
>  66193 nginx    1484152162.182683 CALL
>  setsockopt(0x211,0x6,0x4,0x7fffffffd96c,0x4)
>  66193 nginx    1484152162.182687 RET   setsockopt 0
>  66193 nginx    1484152162.182689 CALL
>  sendfile(0x28,0x211,0,0x806f,0x7fffffffe1d8,0x7fffffffe240,<invalid=0>)
>  66193 nginx    1484152163.541770 RET   sendfile 0
>
> Sendfile blocks for 1.3 seconds. However, it's:
>
> $ sysctl hw.model hw.machine hw.ncpu
> hw.model: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> hw.machine: amd64
> hw.ncpu: 16
>
> So sfbufs don't apply here, but documentation doesn't tell what resource
> applies... or I couldn't find correct doc.

On amd64 sendfile() uses mbufs / mbuf clusters.  Try looking into
"vmstat -z" to see if there are enough mbuf clusters in various
zones.  Note well tha sendfile() normally blocks on disk, and it
might simply mean that your disk subsystem is (occasionally)
overloaded.  Try gstat to see some details.


It is goes red periodically and i should look further there.

At this moment i wonder if turning off sendfile will help. Will nginx use
aio to serve static files from disk without blocking?

Also, sendfile syscall has two flags on bsd to avoid blocking. Can these be
used to avoid blocks?

'
Either way, blocking for 1.3s in a syscall perfectly explains why
300ms timeouts are sometimes triggered unexpectedly.  If a timer
is set after such a blocking call and just before kevent() (that
is, connection is highly unlikely to be established in before
kevent() returns), on the next event loop iteraction nginx will
think that 1.3+ seconds has passed since the timer was set, and
will trigger a timeout (after processing all the events reported by
kevent()).

--
Maxim Dounin
http://nginx.org/
_______________________________________________
nginx mailing list
nginx at nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

' А там
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20170111/11a49db2/attachment.html>


More information about the nginx mailing list