"writev() failed (134: Transport endpoint is not connected)" when upstream down

Fri Apr 5 12:34:04 UTC 2013

Also, here is the log, nginx was compiled with --with-debug and I set:
error_log <file> debug;

There are rate-limiting warning messages in there, though if I disable
the rate-limiting, the issue persists, and the only difference in the
debug log is that there are no rate-limiting warnings.

Thanks,
Branden

On Thu, Apr 4, 2013 at 6:24 PM, Branden Visser <mrvisser at gmail.com> wrote:
> Hello, I've found that when there are upstream servers unavailable in
> my upstream group, applying a little bit of load on the server (i.e.,
> just myself browsing around quickly, 2-3 req/s max) results in the
> following errors even for upstream servers that are available and
> well:
>
> 2013/04/04 22:02:21 [error] 4211#0: *2898 writev() failed (134:
> Transport endpoint is not connected) while sending request to
> upstream, client: 184.94.54.70, server: , request: "GET /api/ui/skin
> HTTP/1.1", upstream: "http://10.112.5.119:2001/api/ui/skin", host:
> "mysite.org", referrer: "http://mysite.org/search"
>
> In this particular example, I have 4 upstreams, 3 servers are shut
> down (all except 10.112.5.119). If I comment out the 3 other upstream
> servers, I cannot reproduce this error.
>
> Running SmartOS (Joyent cloud)
>
> $ nginx -v
> nginx version: nginx/1.3.14
>
> These are things I tried to no avail:
>
> * I used to have keepalive 64 on the upstream, I removed it
> * Nginx used to run as a non-privileged user, I switched it to root
> (prctl reports that privileged users should have 65,000 nofiles
> allowed)
> * I used to have worker_processes set to 5, I increased it to 16
> * The upstream server configuration used to not have max_fails *or*
> max_timeout, I added those in trying to limit the amount of times
> nginx tried to access the downed upstream servers
> * I used to have the proxy_connect_timeout unspecified so it should
> have defaulted to 60s, I tried setting it to 1s
> * I tried commenting out all the rate-limiting directives
>
> The URLs I'm hitting in my tests are all those for the "tenantworkers" upstream.
>
> Any idea? I would think I probably have a resource limit issue, or an
> issue with the back-end server, but it just doesn't make sense that
> everything is OK after I comment out the downed upstreams. My concern
> is that the system will crumble under real load when even 1 upstream
> becomes unavailable.
>
> Thanks,
> Branden
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nginx-error.log
Type: application/octet-stream
Size: 2326261 bytes
Desc: not available
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20130405/07d7e270/attachment-0001.obj>