Issue with AWS NLB and nginx

Mon Nov 20 13:20:08 UTC 2017

Hello!

On Mon, Nov 20, 2017 at 12:31:59PM +0100, DreamWerx wrote:

> I was hoping someone might have an idea here..  I have a number of nginx
> doing load balancing sitting behind AWS's network load balancers (TCP) -
> which seem to only support TCP checks.
> 
> Recently a few have stopped working / frozen - they still seem to accept a
> tcp connection from the NLB - which leads the health check not to fail.
> But they cannot internally process the request and you cannot even ssh into
> the machine.  A reboot is required and that takes longer than normal.
> 
> I think the failure is related to a disk issue since the only error in the
> entire logs where regarding the disk. (error logs below)
> 
> Ideally if nginx or the O/S fails it would be better if the port just
> closed.  I've considered writing a small daemon that monitors via http
> locally and keeps a port open if everything is ok.
> 
> These machines have been running for months now without any issues until
> now.
> 
> Anyone have an idea?

Once nginx is blocked on disk, it likely won't be able to do 
anything else - including closing ports, or accepting connections.  
Native TCP checks will still be able to see it as alive for some 
time though, as they really check that the port is still open.  
Such check will probably only recognize that the service is down 
only when listen queue will be overflowed.

Given the above, it is generally a good idea to monitor not just 
ports, but some meaningful answers from a service.  You should be 
able to configure such checks in AWS.

-- 
Maxim Dounin
http://mdounin.ru/