upstream timeouts I can not explain

Mon Jan 9 19:59:55 UTC 2017

Hello!

On Mon, Jan 09, 2017 at 09:56:54PM +0300, Руслан Закиров wrote:

> On Mon, Jan 9, 2017 at 6:14 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:
> 
> > Hello!
> >
> > On Sun, Jan 08, 2017 at 05:04:47PM +0300, Руслан Закиров wrote:
> >
> > > Hello,
> > >
> > > nginx 1.10.2 [1] running on FreeBSD 10.3-RELEASE-p7
> > >
> > > We have quite a lot of "upstream timed out" errors. For example:
> > >
> > > 2017/01/08 16:20:54 [error] 82154#0: *494223064 upstream timed out (60:
> > > Operation timed out) while connecting to upstream, client: 192.168.1.43,
> > > server: q0.sports.ru, request: "POST /pub?topic=docs HTTP/1.1",
> > upstream: "
> > > http://192.168.1.206:4151/pub?topic=docs", host: "q0.sports.ru"
> > > Here is tcpdump:
> > >
> > > 16:20:53.218916 IP myself-fe.28957 > q0-2.4151: Flags [S], seq
> > 2479999411,
> > > win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079162 ecr 0],
> > > length 0
> > > 16:20:53.219721 IP q0-2.4151 > myself-fe.28957: Flags [S.], seq
> > 605231398,
> > > ack 2479999412, win 28960, options [mss 1460,sackOK,TS val 352522447 ecr
> > > 787079162,nop,wscale 7], length 0
> > > 16:20:53.219742 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 1, win
> > 1040,
> > > options [nop,nop,TS val 787079163 ecr 352522447], length 0
> > > 16:20:54.572712 IP myself-fe.28957 > q0-2.4151: Flags [F.], seq 1, ack 1,
> > > win 1040, options [nop,nop,TS val 787080516 ecr 352522447], length 0
> > > 16:20:54.573173 IP q0-2.4151 > myself-fe.28957: Flags [F.], seq 1, ack 2,
> > > win 227, options [nop,nop,TS val 352522786 ecr 787080516], length 0
> > > 16:20:54.573195 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 2, win
> > 1040,
> > > options [nop,nop,TS val 787080516 ecr 352522786], length 0
> > >
> > > What I don't see here is a packet with HTTP request.
> >
> > The "upstream timeout ... while connecting to upstream" suggests
> > that nginx wasn't able to see the connect event.
> >
> > [...]
> >
> 
> Ok, spent whole day learning and investigating.
> 
> > Any idea where I should look next?
> >
> > Some things to consider:
> >
> > - Make sure you are looking at tcpdump on the nginx host, and
> >   there are no firewalls on the host to interfere with.
> >
> 
> These were tcpdumps from nginx host. I have dump from other end and they
> are symmetrical. We have proxy_connect_timeout at 300ms at the top level of
> the config. When we first started to investigate it we increased timeout to
> 1s for
> this location. An hour ago I increased it to 5 seconds and finally couldn't
> reproduce
> the problem with a simple "bomber" script.
> 
> From dumps you can see that connection was established within 10ms. What
> can stop nginx from receiving the event for more than a second?
> 
> This happens on all served domains as pretty much everywhere connect
> timeout is 300ms. If I tail -F error log, count this error occurrences
> grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
> 20-40 seconds then ~200 errors in a few seconds. Is there anything that may
> block events processing nginx for quite a while?

Typical kern.sched.quantum is about 100ms, so several 
CPU-intensive tasks can delay processing of the events enough to 
trigger a timeout if a context switch happens at a bad time.

Note well that various blocking operations in nginx itself - 
either disk or CPU-intensive ones - can also delay processing of 
various events, and this in turn can trigger unexpected timeouts 
when using timers comparable to a typical delay introduced on each 
event loop iteration.

-- 
Maxim Dounin
http://nginx.org/