upstream timeouts I can not explain

Tue Jan 10 19:13:37 UTC 2017

All hosts have characteristic stalls and blips but the scale of this issue can vary 100x depending on is configuration. You can get some data about these stalls using solar flare's sysjitter utility or Gil Tene's jhiccup.

Sent from my iPhone

On Jan 10, 2017, at 12:46 PM, Руслан Закиров <ruz at sports.ru> wrote:

>> > > The "upstream timeout ... while connecting to upstream" suggests
>> > > that nginx wasn't able to see the connect event.
>> > >
>> > > [...]
>> > >
>> > > Some things to consider:
>> > >
>> > > - Make sure you are looking at tcpdump on the nginx host, and
>> > >   there are no firewalls on the host to interfere with.
>> > >
>> >
>> > These were tcpdumps from nginx host. I have dump from other end and they
>> > are symmetrical. We have proxy_connect_timeout at 300ms at the top level of
>> > the config. When we first started to investigate it we increased timeout to
>> > 1s for
>> > this location. An hour ago I increased it to 5 seconds and finally couldn't
>> > reproduce
>> > the problem with a simple "bomber" script.
>> >
>> > From dumps you can see that connection was established within 10ms. What
>> > can stop nginx from receiving the event for more than a second?
>> >
>> > This happens on all served domains as pretty much everywhere connect
>> > timeout is 300ms. If I tail -F error log, count this error occurrences
>> > grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
>> > 20-40 seconds then ~200 errors in a few seconds. Is there anything that may
>> > block events processing nginx for quite a while?
>> 
>> Typical kern.sched.quantum is about 100ms, so several
>> CPU-intensive tasks can delay processing of the events enough to
>> trigger a timeout if a context switch happens at a bad time.
>> 
>> Note well that various blocking operations in nginx itself -
>> either disk or CPU-intensive ones - can also delay processing of
>> various events, and this in turn can trigger unexpected timeouts
>> when using timers comparable to a typical delay introduced on each
>> event loop iteration.
> 
> We tuned upstreams' parameters to avoid both backend servers marked as unavailable during these spikes. This prevents bogus errors. 
> 
> Also, in this particular service I'm experimenting with keepalive connections between nginx and upstreams.
> 
> Above steps don't solve root cause. Can you suggest me further steps to localize the issue? I'm not sure how to detect if it's blocking operation in nginx, OS scheduling or something else.
> 
> -- 
> Руслан Закиров
> Руководитель отдела разработки веб-сервисов
> +7(916) 597-92-69, ruz @ 
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20170110/6bcbecf5/attachment-0001.html>