In-flight HTTP requests fail during hot configuration reload (SIGHUP)

Mon Jun 1 15:28:21 UTC 2015

Hello!

On Mon, Jun 01, 2015 at 03:40:12PM +0100, Matthew O'Riordan wrote:

> We have recently migrated across from HAProxy to Nginx because 
> it supports true zero-downtime configuration reloads.   However, 
> we are occasionally getting 502 and 504 errors from our 
> monitoring systems during deployments.  Looking into this, I 
> have been able to consistently replicate the 502 and 504 errors 
> as follows.  I believe this is an error in how Nginx handles 
> in-flight requests, but wanted to ask the community in case I am 
> missing something obvious.
> 
> Note the set up of Nginx is as follows:
> * Ubuntu 14.04
> * Nginx version 1.9.1
> * Configuration for an HTTP listener:
>   map $http_upgrade $connection_upgrade {
>     default upgrade;
>     ''      close;
>   }
>    server {
>     listen 8080;
>      # pass on real client's IP
>     proxy_set_header  X-Real-IP         $remote_addr;
>     proxy_set_header  X-Forwarded-For   $proxy_add_x_forwarded_for;
>      access_log /var/log/nginx/access.ws-8080.log combined;
> 
>      location / {
>       proxy_pass http://server-ws-8080;
>       proxy_http_version 1.1;
>       proxy_set_header Upgrade $http_upgrade;
>       proxy_set_header Connection $connection_upgrade;
>     }
>   }
> 
>   upstream server-ws-8080 {
>     least_conn;
>     server 172.17.0.51:8080 max_fails=0;
>   }
> 
> 1. Telnet to the Nginx server on the HTTP port it is listening on.
> 
> 2. Send a HTTP/1.1 request to the upstream server (172.17.0.51):
> GET /health HTTP/1.1
> Host: localhost
> Connection: Keep-Alive
> 
> This request succeeds and the response is valid
> 
> 3. Start a new HTTP/1.1 request but don’t finish the request 
> i.e. send the following line using telnet:
> GET /health HTTP/1.1
> 
> 4. Whilst that request is now effectively in-flight because it’s 
> not finished and Nginx is waiting for the request to be 
> completed, reconfigure Nginx with a SIGHUP signal.  The only 
> difference in the config preceding the SIGHUP signal is that the 
> upstream server has changed i.e. we intentionally want all new 
> requests to go to the new upstream server.
> 
> 5. Terminate the old upstream server 172.17.0.51
> 
> 6. Complete the in-flight HTTP/1.1 request started in point 3 
> above with:
> Host: localhost
> Connection: Keep-Alive
> 
> 7. Nginx will consistently respond with a 502 if the old 
> upstream server rejects the request, or a 504 if there is no 
> response on that IP and port.  
> 
> I believe this behaviour is incorrect as Nginx, once it receives 
> the complete request, should direct the request to the current 
> available upstream server.  However, it seems that that Nginx is 
> instead deciding which upstream server to send the request to 
> before the request is completed and as such is directing the 
> request to a server that no longer exists.

Your problem is in step (5).  While you've started new nginx 
workers to handle new requests in step (4), this doesn't guarantee 
that old upstream servers are no longer needed.  Only new 
connections will be processed by new worker processes with new 
nginx configuration.  Old workers continue to service requests 
started before you've reconfigured nginx, and will only terminate 
once all previously started requests are finished.  This includes 
requests already send to an upstream server and reading a 
response, and requests not yet read from a client.  For these 
requests previous configuration apply, and you shouldn't stop old 
upstream servers till old worker processes are shut down.

Some details about reconfiguration process can be found here:

http://nginx.org/en/docs/control.html#reconfiguration

-- 
Maxim Dounin
http://nginx.org/