upstream timeouts I can not explain
ruz at sports.ru
Tue Jan 10 17:46:39 UTC 2017
> > > The "upstream timeout ... while connecting to upstream" suggests
> > > that nginx wasn't able to see the connect event.
> > >
> > > [...]
> > >
> > > Some things to consider:
> > >
> > > - Make sure you are looking at tcpdump on the nginx host, and
> > > there are no firewalls on the host to interfere with.
> > >
> > These were tcpdumps from nginx host. I have dump from other end and they
> > are symmetrical. We have proxy_connect_timeout at 300ms at the top level
> > the config. When we first started to investigate it we increased timeout
> > 1s for
> > this location. An hour ago I increased it to 5 seconds and finally
> > reproduce
> > the problem with a simple "bomber" script.
> > From dumps you can see that connection was established within 10ms. What
> > can stop nginx from receiving the event for more than a second?
> > This happens on all served domains as pretty much everywhere connect
> > timeout is 300ms. If I tail -F error log, count this error occurrences
> > grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
> > 20-40 seconds then ~200 errors in a few seconds. Is there anything that
> > block events processing nginx for quite a while?
> Typical kern.sched.quantum is about 100ms, so several
> CPU-intensive tasks can delay processing of the events enough to
> trigger a timeout if a context switch happens at a bad time.
> Note well that various blocking operations in nginx itself -
> either disk or CPU-intensive ones - can also delay processing of
> various events, and this in turn can trigger unexpected timeouts
> when using timers comparable to a typical delay introduced on each
> event loop iteration.
We tuned upstreams' parameters to avoid both backend servers marked as
unavailable during these spikes. This prevents bogus errors.
Also, in this particular service I'm experimenting with keepalive
connections between nginx and upstreams.
Above steps don't solve root cause. Can you suggest me further steps to
localize the issue? I'm not sure how to detect if it's blocking operation
in nginx, OS scheduling or something else.
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the nginx