In-flight HTTP requests fail during hot configuration reload (SIGHUP)

Mon Jun 1 20:21:12 UTC 2015

Hi Maxim

Thanks for the reply.

Few comments below with context for others reading this thread:

>> 1. Telnet to the Nginx server on the HTTP port it is listening on.
>> 
>> 2. Send a HTTP/1.1 request to the upstream server (172.17.0.51):
>> GET /health HTTP/1.1
>> Host: localhost
>> Connection: Keep-Alive
>> 
>> This request succeeds and the response is valid
>> 
>> 3. Start a new HTTP/1.1 request but don?t finish the request 
>> i.e. send the following line using telnet:
>> GET /health HTTP/1.1
>> 
>> 4. Whilst that request is now effectively in-flight because it?s 
>> not finished and Nginx is waiting for the request to be 
>> completed, reconfigure Nginx with a SIGHUP signal.  The only 
>> difference in the config preceding the SIGHUP signal is that the 
>> upstream server has changed i.e. we intentionally want all new 
>> requests to go to the new upstream server.
>> 
>> 5. Terminate the old upstream server 172.17.0.51
>> 
>> 6. Complete the in-flight HTTP/1.1 request started in point 3 
>> above with:
>> Host: localhost
>> Connection: Keep-Alive
>> 
>> 7. Nginx will consistently respond with a 502 if the old 
>> upstream server rejects the request, or a 504 if there is no 
>> response on that IP and port.  
> 
> Your problem is in step (5).  While you've started new nginx 
> workers to handle new requests in step (4), this doesn't guarantee 
> that old upstream servers are no longer needed.

I realise that is the problem, but I am not quite sure what the best strategy to correct this is.  We are experiencing this problem in production environments because Nginx sits behind an Amazon ELB.  ELB by default will maintain a connection to the client (browser for example) and a backend server (Nginx in this case).  What we seem to be experiencing is that because ELB has opened a connection to Nginx, Nginx has automatically assigned this socket to an existing healthy upstream server.  So even if a SIGHUP is sent to Nginx, ELB’s next request will always be processed by the old upstream server at the time the connection to Nginx was opened.  So therefore for us to do rolling deployments, we have to keep the old server running for periods of up to say 2 minutes to ensure existing connection requests are completed.  We have designed our upstream server so that it will complete existing in-flight requests, however our upstream server thinks that an in-flight request is one that is being responded to, not one that is perhaps just opened and no data has been sent from the client to the server on the socket yet.

> Only new connections will be processed by new worker processes with new 
> nginx configuration.  Old workers continue to service requests 
> started before you've reconfigured nginx, and will only terminate 
> once all previously started requests are finished.  This includes 
> requests already send to an upstream server and reading a 
> response, and requests not yet read from a client.  For these 
> requests previous configuration apply, and you shouldn't stop old 
> upstream servers till old worker processes are shut down.

Ok, however we do need a sensible timeout to ensure we do actually shut down our old upstream servers too. This is the problem I am finding with the strategy we currently have.
ELB, for example, pipelines requests using a single TCP connection in accordance with the HTTP/1.1 spec.  When a SIGHUP is sent to Nginx, how does it then deal with pipelined requests?  Will it process all received requests and then issue a "Connection: Close” header, or will it process the current request and then close the connection?  If the former, then it’s quite possible that in the time those in-flight requests are responded to, another X number of requests will have been received also in the pipeline.

> Some details about reconfiguration process can be found here:
> http://nginx.org/en/docs/control.html#reconfiguration <http://nginx.org/en/docs/control.html#reconfiguration>
I have read that page previously, but unfortunately I found it did not reveal much in regards to how it handles keep-alive and pipelining.

Thanks again,
Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20150601/1ead53fa/attachment.html>