Socket connection failures on 1.6.1~precise

Wed Sep 10 03:15:20 UTC 2014

Just closing the loop on this, but what appeared to be happening was 
that newly created nodes were not having the nginx master PID start up 
with a custom ulimit set in /etc/security/limits.d/.  The workers were 
all fine since the worker_rlimit_nofile was set in the nginx.conf, but I 
was running into a separate issue that was preventing nginx from 
inheriting the custom ulimit setting for that master PID file.

Truth be told, I never quite nailed down an exact RCA other than 
ensuring the nginx master PID came up with the custom ulimit setting.  
That would seem to indicate something was causing a spike in the number 
of open files for the master PID, but I can look into that separately.

On 09/02/2014 03:35 PM, Jon Clayton wrote:
> I did see the changelog hadn't noted many changes and running a diff 
> of the versions shows what you mentioned regarding the 400 bad request 
> handling code.  I'm not necessarily stating that nginx is the problem, 
> but it would seem like something had changed enough to cause the 
> backend's backlog to fill more rapidly.
>
> That could be a completely bogus statement as I've been attempting to 
> find a way to track down exactly what backlog is being filled, but my 
> test of downgrading nginx back to 1.6.0 from the nginx ppa seemed to 
> also point at a change in nginx causing the issue since the errors did 
> not persist after downgrading.
>
> It's very possible that I'm barking up the wrong tree, but the fact 
> that only changing nginx versions back down to 1.6.0 from 1.6.1 
> eliminated the errors seems suspicious.  I'll keep digging, but I'm 
> open to any other suggestions.
>
>
> On 09/02/2014 02:14 PM, Maxim Dounin wrote:
>> Hello!
>>
>> On Tue, Sep 02, 2014 at 11:00:10AM -0500, Jon Clayton wrote:
>>
>>> I'm trying to track down an issue that is being presented only when 
>>> I run
>>> nginx version 1.6.1-1~precise.  My nodes running 1.6.0-1~precise do not
>>> display this issue, but freshly created servers are getting floods 
>>> of these
>>> socket connection issues a couple times a day.
>>>
>>> /connect() to unix:/tmp/unicorn.sock failed (11: Resource temporarily
>>> unavailable) while connecting to upstream/
>>>
>>> The setup I'm working with is nginx proxying requests to a unicorn 
>>> socket
>>> powered by a ruby app.  As stated above, the error is NOT present on 
>>> nodes
>>> running 1.6.0-1~precise, but any newly created node gets the newer
>>> 1.6.1-1~precise package installed and will inevitably have that error.
>>>
>>> All settings from nodes running 1.6.0 appear to be the same as newly 
>>> created
>>> nodes on 1.6.1 in terms of sysctl settings, nginx settings, and unicorn
>>> settings.  All package versions are the same except for nginx.  When I
>>> downgraded one of the newly created nodes to nginx 1.6.0 using the 
>>> nginx ppa
>>> (ref:
>>> https://launchpad.net/~nginx/+archive/ubuntu/stable), the error was not
>>> present.
>>>
>>> Is there any advice, direction, or similar issue experienced that 
>>> someone
>>> else might be able to help me track this down?
>> Just some information:
>>
>> - In nginx itself, the difference between 1.6.0 and 1.6.1 is fairy
>>    minimal.  The only change affecting http is one code line added
>>    in the 400 Bad Request handling code
>>    (see http://hg.nginx.org/nginx/rev/b8188afb3bbb).
>>
>> - The message suggests that backend's backlog is full.  This can
>>    easily happen on load spikes and/or if a backend is overloaded,
>>    and usually unrelated to the nginx itself.
>>
>
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx