Fwd: [ module ] Add http upstream keep alive timeout parameter

Sun Nov 12 12:25:20 UTC 2017

We are running Nginx and upstream on the same machine using docker, so
there's no firewall.

I did a test locally and captured the network packages.

For the normal requests, upstream send a [FIN, ACK] to nginx after
keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then
upstream send a [ACK] to close the connection completely.
1
2
3
4
5
6
7
8
9
10
11
No.     Time                          Source
Destination           Protocol Length Info
      1 2017-11-12 17:11:04.299146    172.18.0.3
172.18.0.2            TCP      74     48528 → 8000 [SYN] Seq=0 Win=29200
Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128
      2 2017-11-12 17:11:04.299171    172.18.0.2
172.18.0.3            TCP      74     8000 → 48528 [SYN, ACK] Seq=0 Ack=1
Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128
      3 2017-11-12 17:11:04.299194    172.18.0.3
172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=1 Ack=1
Win=29312 Len=0 TSval=32031305 TSecr=32031305
      4 2017-11-12 17:11:04.299259    172.18.0.3
172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1
      5 2017-11-12 17:11:04.299267    172.18.0.2
172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=1 Ack=176
Win=30080 Len=0 TSval=32031305 TSecr=32031305
      6 2017-11-12 17:11:04.299809    172.18.0.2
172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
      7 2017-11-12 17:11:04.299852    172.18.0.3
172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=176 Ack=206
Win=30336 Len=0 TSval=32031305 TSecr=32031305
      8 2017-11-12 17:11:04.800805    172.18.0.2
172.18.0.3            TCP      66     8000 → 48528 [FIN, ACK] Seq=206
Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305
      9 2017-11-12 17:11:04.801120    172.18.0.3
172.18.0.2            TCP      66     48528 → 8000 [FIN, ACK] Seq=176
Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355
     10 2017-11-12 17:11:04.801151    172.18.0.2
172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=207 Ack=177
Win=30080 Len=0 TSval=32031355 TSecr=32031355

For the failed requests, upstream received a new http request when it had
closed the connection after keep-alive timeout (500 ms) and hasn’t got a
chance to send the [FIN] package. Because of the connection has been closed
from upstream’s perspective, so it send a [RST] response for this request.

1
2
3
4
5
6
7
8
9
10
11
12
13
No.     Time                          Source
Destination           Protocol Length Info
    433 2017-11-12 17:11:26.548449    172.18.0.3
172.18.0.2            TCP      74     48702 → 8000 [SYN] Seq=0 Win=29200
Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128
    434 2017-11-12 17:11:26.548476    172.18.0.2
172.18.0.3            TCP      74     8000 → 48702 [SYN, ACK] Seq=0 Ack=1
Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128
    435 2017-11-12 17:11:26.548502    172.18.0.3
172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=1 Ack=1
Win=29312 Len=0 TSval=32033530 TSecr=32033530
    436 2017-11-12 17:11:26.548609    172.18.0.3
172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1
    437 2017-11-12 17:11:26.548618    172.18.0.2
172.18.0.3            TCP      66     8000 → 48702 [ACK] Seq=1 Ack=176
Win=30080 Len=0 TSval=32033530 TSecr=32033530
    438 2017-11-12 17:11:26.549173    172.18.0.2
172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    439 2017-11-12 17:11:26.549230    172.18.0.3
172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=176 Ack=206
Win=30336 Len=0 TSval=32033530 TSecr=32033530
    440 2017-11-12 17:11:27.049668    172.18.0.3
172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1
    441 2017-11-12 17:11:27.050324    172.18.0.2
172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    442 2017-11-12 17:11:27.050378    172.18.0.3
172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=351 Ack=411
Win=31360 Len=0 TSval=32033580 TSecr=32033580
    443 2017-11-12 17:11:27.551182    172.18.0.3
172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1
    444 2017-11-12 17:11:27.551294    172.18.0.2
172.18.0.3            TCP      66     8000 → 48702 [RST, ACK] Seq=411
Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630

When nginx receives the [RST] package, it will log a ‘Connection reset’
error.

I'm testing by set up the environment:

Upstream (Node.js server):

   - Set keep-alive timeout to 500 ms

Test client:

   - Keep sending requests with an interval
   - Interval starts from 500 ms and decrease 0.1 ms after each request

For more detailed description of the test process, you can reference my
post at:
https://theantway.com/2017/11/analyze-connection-reset-error-in-nginx-upstream-with-keep-alive-enabled/

To Fix the issue, I tried to add a timeout for keep-alived upstream, and
you can check the patch at:
https://github.com/weixu365/nginx/blob/docker-1.13.6/docker/stretch/patches/01-http-upstream-keepalive-timeout.patch

The patch is for my current testing, and I can create a different format if
you need.

Regards

Wei Xu

On Fri, Nov 10, 2017 at 4:07 AM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Thu, Nov 02, 2017 at 08:41:16PM +1100, Wei Xu wrote:
>
> > Hi
> > I saw there's an issue talking about "implement keepalive timeout for
> > upstream <https://trac.nginx.org/nginx/ticket/1170>".
> >
> > I have a different scenario for this requirement.
> >
> > I'm using Node.js web server as upstream, and set keep alive time out to
> 60
> > second in nodejs server. The problem is I found more than a hundred
> > "Connection reset by peer" errors everyday.
> >
> > Because there's no any errors on nodejs side, I guess it was because of
> the
> > upstream has disconnected, and at the same time, nginx send a new
> request,
> > then received a TCP RST.
>
> Could you please trace what actually happens on the network level
> to confirm the guess is correct?
>
> Also, please check that there are no stateful firewalls between
> nginx and the backend.  A firewall which drops the state before
> the timeout expires looks like a much likely cause for such
> errors.
>
> --
> Maxim Dounin
> http://mdounin.ru/
> _______________________________________________
> nginx-devel mailing list
> nginx-devel at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20171112/84ca2146/attachment-0001.html>