[PATCH] SO_REUSEPORT support for listen sockets (round 3)

Sepherosa Ziehau sepherosa at gmail.com
Tue Sep 3 02:31:55 UTC 2013

On Mon, Sep 2, 2013 at 10:49 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
> (Sorry again for late reply.  See below for comments.)
Thank you for the reply.

> On Fri, Aug 02, 2013 at 01:16:53PM +0800, Sepherosa Ziehau wrote:
> > Here is another round of SO_REUSEPORT support.  The plot is changed a
> > little bit to allow smooth configure reloading and binary upgrading.
> > Here is what happens when so_reuseport is enable (this does not affect
> > single process model):
> > - Master creates the listen sockets w/ SO_REUSEPORT, but does not
> configure them
> > - The first worker process will inherit the listen sockets created by
> > master and configure them
> > - After master forked the first worker process all listen sockets are
> closed
> > - The rest of the workers will create their own listen sockets w/
> > - During binary upgrade, listen sockets are no longer passed through
> > environment variables, since new master will create its own listen
> > sockets.  Well, the old master actually does not have any listen
> > sockets opened :).
> >
> > The idea behind this plot is that at any given time, there is always
> > one listen socket left, which could inherit the syncaches and pending
> > sockets on the to-be-closed listen sockets.  The inheritance itself is
> > handled by the kernel; I implemented this inheritance for DragonFlyBSD
> > recently (
> http://gitweb.dragonflybsd.org/dragonfly.git/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272
> ).
> >  I am not tracking Linux's code, but I think Linux side will
> > eventually get (or already got) the proper fix.
> >
> > The patch itself:
> > http://leaf.dragonflybsd.org/~sephe/ngx_soreuseport3.diff
> >
> > Configuration reloading and binary upgrading will not be interfered as
> > w/ the first 2 patches.
> >
> > Binary upgrading reverting method 1 ("Send the HUP signal to the old
> > master process. ...") will not be interfered as w/ the first 2
> > patches.  There still could be some glitch (but not that worse as w/
> > the first 2 patches) if binary upgrading reverting method 2 ("Send the
> > TERM signal to the new master process. ...") is used.  I think we
> > probably just need to mention that in the document.
> While this look like better that what was with previous patches
> (mostly due to inheritance handled by kernel), it still looks very
> fragile for me.  In particular, I really dislike the trick with
> making first worker process special.
Well, the idea is to keep at least one listen socket opened.  Maybe I could
find other way in kernel to make it less tricky.  However, that may add
extra syscall or socket option.

> It's probably should either left in the state "nothing is
> guaranteed" (with some understanding of what will happen in
> various common situations like reconfiguration, upgrade, switching
> so_reuseport on/off) or some way should be found to make things
> less tricky.

To be frank, at least interfering the reconfigure probably is not wanted.
And I don't want "nothing is guaranteed" (which probably is the first 2

> Additional question to consider is what happens with security
> checks?  Linux seems to require processs user id match on
> SO_REUSEPORT sockets, and I would expect this to fail if there are

BSD's SO_REUSEPORT don't check uid.  However, as far as I understand the
code, when nginx worker creates SO_REUSEPORT listen socket, the uid is not
changed yet.

> sockets opened both in master and in worker processes; and
> privileged port checks might cause problems as well.

See the above comment.

> (We've also discussed this here in office serveral times, and it
> seems that general consensus is that SO_REUSEPORT for TCP balancing
> isn't really good interface.  It would be much easier for everyone
> if normal workflow with inherited listen socket descriptors just
> worked.  Especially given the fact that in nginx case it's mostly
> about benchmarking, since in real life load distribution between
> worker processes is good enough.)

In DragonFly, SO_REUSEPORT is more than load balance: it makes the accepted
sockets network processing completely CPU localized (from user land to
kernel land on both RX and TX path).  This level of network processing CPU
localization could not be achieved by the old listen socket inheritance
usage model (even if I could divide listen socket's completion queue to
each CPU base on RX hash, the level of CPU localization achieved by
SO_REUSEPORT still could not be achieved easily).  In addition to the CPU
localization, it also avoids nginx's accept mutex contention (I have not
measured the contention rate though, but no contention should be better,

Best Regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20130903/f75aaf9a/attachment-0001.html>

More information about the nginx-devel mailing list