Fwd: [ module ] Add http upstream keep alive timeout parameter

Mon Nov 13 19:49:46 UTC 2017

Hello!

On Sun, Nov 12, 2017 at 11:25:20PM +1100, Wei Xu wrote:

> We are running Nginx and upstream on the same machine using docker, so
> there's no firewall.

Note that this isn't usually true.  Docker uses iptables 
implicitly, and unless you specifically checked your iptables 
configuration - likely you are using firewall.

> I did a test locally and captured the network packages.
> 
> For the normal requests, upstream send a [FIN, ACK] to nginx after
> keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then
> upstream send a [ACK] to close the connection completely.

[...]

> For more detailed description of the test process, you can reference my
> post at:
> https://theantway.com/2017/11/analyze-connection-reset-error-in-nginx-upstream-with-keep-alive-enabled/

The test demonstrates that it is indeed possible to trigger the 
problem in question.  Unfortunately, it doesn't provide any proof 
that what you observed in production is the same issue though.

While it is more or less clear that the race condition in question 
is real, it seems to be very unlikely with typical workloads.  And 
even when triggered, in most cases nginx handles this good enough, 
re-trying the request per proxy_next_upstream.

Nevertheless, thank you for detailed testing.  A simple test case 
that reliably demonstrates the race is appreciated, and I was able 
to reduce it to your client script and nginx with the following 
trivial configuration:

    upstream u {
        server 127.0.0.1:8082;
        keepalive 10;
    }

    server {
        listen 8080;

        location / {
            proxy_pass http://u;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
        }
    }

    server {
        listen 8082;

        keepalive_timeout 500ms;

        location / {
            return 200 ok\n;
        }
    }

> To Fix the issue, I tried to add a timeout for keep-alived upstream, and
> you can check the patch at:
> https://github.com/weixu365/nginx/blob/docker-1.13.6/docker/stretch/patches/01-http-upstream-keepalive-timeout.patch
> 
> The patch is for my current testing, and I can create a different format if
> you need.

The patch looks good enough for testing, though there are various 
minor issues - notably testing timeout for NGX_CONF_UNSET_MSEC at 
runtime, using wrong type for timeout during parsing (time_t 
instead of ngx_msec_t).

Also I tend to think that using a separate keepalive_timeout 
directive should be easier, and we probably want to introduce some 
default value for it.

Please take a look if the following patch works for you:

# HG changeset patch
# User Maxim Dounin <mdounin at mdounin.ru>
# Date 1510601341 -10800
#      Mon Nov 13 22:29:01 2017 +0300
# Node ID 9ba0a577601b7c1b714eb088bc0b0d21c6354699
# Parent  6f592a42570898e1539d2e0b86017f32bbf665c8
Upstream keepalive: keepalive_timeout directive.

The directive configures maximum time a connection can be kept in the
cache.  By configuring a time which is smaller than the corresponding
timeout on the backend side one can avoid the race between closing
a connection by the backend and nginx trying to use the same connection
to send a request at the same time.

diff --git a/src/http/modules/ngx_http_upstream_keepalive_module.c b/src/http/modules/ngx_http_upstream_keepalive_module.c
--- a/src/http/modules/ngx_http_upstream_keepalive_module.c
+++ b/src/http/modules/ngx_http_upstream_keepalive_module.c
@@ -12,6 +12,7 @@
 
 typedef struct {
     ngx_uint_t                         max_cached;
+    ngx_msec_t                         timeout;
 
     ngx_queue_t                        cache;
     ngx_queue_t                        free;
@@ -84,6 +85,13 @@ static ngx_command_t  ngx_http_upstream_
       0,
       NULL },
 
+    { ngx_string("keepalive_timeout"),
+      NGX_HTTP_UPS_CONF|NGX_CONF_TAKE1,
+      ngx_conf_set_msec_slot,
+      NGX_HTTP_SRV_CONF_OFFSET,
+      offsetof(ngx_http_upstream_keepalive_srv_conf_t, timeout),
+      NULL },
+
       ngx_null_command
 };
 
@@ -141,6 +149,8 @@ ngx_http_upstream_init_keepalive(ngx_con
 
     us->peer.init = ngx_http_upstream_init_keepalive_peer;
 
+    ngx_conf_init_msec_value(kcf->timeout, 60000);
+
     /* allocate cache items and add to free queue */
 
     cached = ngx_pcalloc(cf->pool,
@@ -261,6 +271,10 @@ found:
     c->write->log = pc->log;
     c->pool->log = pc->log;
 
+    if (c->read->timer_set) {
+        ngx_del_timer(c->read);
+    }
+
     pc->connection = c;
     pc->cached = 1;
 
@@ -339,9 +353,8 @@ ngx_http_upstream_free_keepalive_peer(ng
 
     pc->connection = NULL;
 
-    if (c->read->timer_set) {
-        ngx_del_timer(c->read);
-    }
+    ngx_add_timer(c->read, kp->conf->timeout);
+
     if (c->write->timer_set) {
         ngx_del_timer(c->write);
     }
@@ -392,7 +405,7 @@ ngx_http_upstream_keepalive_close_handle
 
     c = ev->data;
 
-    if (c->close) {
+    if (c->close || c->read->timedout) {
         goto close;
     }
 
@@ -485,6 +498,8 @@ ngx_http_upstream_keepalive_create_conf(
      *     conf->max_cached = 0;
      */
 
+    conf->timeout = NGX_CONF_UNSET_MSEC;
+
     return conf;
 }
 

-- 
Maxim Dounin
http://mdounin.ru/