<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div>All hosts have characteristic stalls and blips but the scale of this issue can vary 100x depending on is configuration. You can get some data about these stalls using solar flare's sysjitter utility or Gil Tene's jhiccup.<br><br>Sent from my iPhone</div><div><br>On Jan 10, 2017, at 12:46 PM, Руслан Закиров <<a href="mailto:ruz@sports.ru">ruz@sports.ru</a>> wrote:<br><br></div><blockquote type="cite"><div><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">> > The "upstream timeout ... while connecting to upstream" suggests<br>

> > that nginx wasn't able to see the connect event.<br>

> ><br>

> > [...]<br>

> ><br>> > Some things to consider:<br>

> ><br>

> > - Make sure you are looking at tcpdump on the nginx host, and<br>

> >   there are no firewalls on the host to interfere with.<br>

> ><br>

><br>

> These were tcpdumps from nginx host. I have dump from other end and they<br>

> are symmetrical. We have proxy_connect_timeout at 300ms at the top level of<br>

> the config. When we first started to investigate it we increased timeout to<br>

> 1s for<br>

> this location. An hour ago I increased it to 5 seconds and finally couldn't<br>

> reproduce<br>

> the problem with a simple "bomber" script.<br>

><br>

> From dumps you can see that connection was established within 10ms. What<br>

> can stop nginx from receiving the event for more than a second?<br>

><br>

> This happens on all served domains as pretty much everywhere connect<br>

> timeout is 300ms. If I tail -F error log, count this error occurrences<br>

> grouped by second then I see 1-3 seconds spikes: silence or <5 errors for<br>

> 20-40 seconds then ~200 errors in a few seconds. Is there anything that may<br>

> block events processing nginx for quite a while?<br>

<br>

</div></div>Typical kern.sched.quantum is about 100ms, so several<br>

CPU-intensive tasks can delay processing of the events enough to<br>

trigger a timeout if a context switch happens at a bad time.<br>

<br>

Note well that various blocking operations in nginx itself -<br>

either disk or CPU-intensive ones - can also delay processing of<br>

various events, and this in turn can trigger unexpected timeouts<br>

when using timers comparable to a typical delay introduced on each<br>

event loop iteration.</blockquote><div><br></div><div>We tuned upstreams' parameters to avoid both backend servers marked as unavailable during these spikes. This prevents bogus errors. </div></div><div class="gmail_extra"><br></div>Also, in this particular service I'm experimenting with keepalive connections between nginx and upstreams.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Above steps don't solve root cause. Can you suggest me further steps to localize the issue? I'm not sure how to detect if it's blocking operation in nginx, OS scheduling or something else.</div><div class="gmail_extra"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Руслан Закиров</div><div>Руководитель отдела разработки веб-сервисов</div><div><span>+7(916) 597-92-69</span>, <span>ruz @ <a href="http://www.sports.ru/" target="_blank"><img src="http://farm7.static.flickr.com/6235/6210250811_19a888dbba_o.jpg" width="43" height="14" style="vertical-align:bottom;margin-right:0px"></a></span></div></div></div>

</div></div>

</div></blockquote><blockquote type="cite"><div><span>_______________________________________________</span><br><span>nginx mailing list</span><br><span><a href="mailto:nginx@nginx.org">nginx@nginx.org</a></span><br><span><a href="http://mailman.nginx.org/mailman/listinfo/nginx">http://mailman.nginx.org/mailman/listinfo/nginx</a></span></div></blockquote></body></html>