upstream timeouts I can not explain

Mon Jan 9 18:56:54 UTC 2017

On Mon, Jan 9, 2017 at 6:14 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Sun, Jan 08, 2017 at 05:04:47PM +0300, Руслан Закиров wrote:
>
> > Hello,
> >
> > nginx 1.10.2 [1] running on FreeBSD 10.3-RELEASE-p7
> >
> > We have quite a lot of "upstream timed out" errors. For example:
> >
> > 2017/01/08 16:20:54 [error] 82154#0: *494223064 upstream timed out (60:
> > Operation timed out) while connecting to upstream, client: 192.168.1.43,
> > server: q0.sports.ru, request: "POST /pub?topic=docs HTTP/1.1",
> upstream: "
> > http://192.168.1.206:4151/pub?topic=docs", host: "q0.sports.ru"
> > Here is tcpdump:
> >
> > 16:20:53.218916 IP myself-fe.28957 > q0-2.4151: Flags [S], seq
> 2479999411,
> > win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079162 ecr 0],
> > length 0
> > 16:20:53.219721 IP q0-2.4151 > myself-fe.28957: Flags [S.], seq
> 605231398,
> > ack 2479999412, win 28960, options [mss 1460,sackOK,TS val 352522447 ecr
> > 787079162,nop,wscale 7], length 0
> > 16:20:53.219742 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 1, win
> 1040,
> > options [nop,nop,TS val 787079163 ecr 352522447], length 0
> > 16:20:54.572712 IP myself-fe.28957 > q0-2.4151: Flags [F.], seq 1, ack 1,
> > win 1040, options [nop,nop,TS val 787080516 ecr 352522447], length 0
> > 16:20:54.573173 IP q0-2.4151 > myself-fe.28957: Flags [F.], seq 1, ack 2,
> > win 227, options [nop,nop,TS val 352522786 ecr 787080516], length 0
> > 16:20:54.573195 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 2, win
> 1040,
> > options [nop,nop,TS val 787080516 ecr 352522786], length 0
> >
> > What I don't see here is a packet with HTTP request.
>
> The "upstream timeout ... while connecting to upstream" suggests
> that nginx wasn't able to see the connect event.
>
> [...]
>

Ok, spent whole day learning and investigating.

> Any idea where I should look next?
>
> Some things to consider:
>
> - Make sure you are looking at tcpdump on the nginx host, and
>   there are no firewalls on the host to interfere with.
>

These were tcpdumps from nginx host. I have dump from other end and they
are symmetrical. We have proxy_connect_timeout at 300ms at the top level of
the config. When we first started to investigate it we increased timeout to
1s for
this location. An hour ago I increased it to 5 seconds and finally couldn't
reproduce
the problem with a simple "bomber" script.

>From dumps you can see that connection was established within 10ms. What
can stop nginx from receiving the event for more than a second?

This happens on all served domains as pretty much everywhere connect
timeout is 300ms. If I tail -F error log, count this error occurrences
grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
20-40 seconds then ~200 errors in a few seconds. Is there anything that may
block events processing nginx for quite a while?

- Try collecting nginx debug logs, they should have enough
>   information to see if the event was reported by the kernel or
>   not.
>

As you can see from above it is reported by the kernel, but with quite big
delay. Not sure if delay is in kernel or nginx, no delay on the network.

> Is it some known issue?
>
> No.

>
> --
> Maxim Dounin
> http://nginx.org/
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx

-- 
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @  <http://www.sports.ru/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20170109/55a59d48/attachment.html>