cache manager process exited with fatal code 2 and cannot be respawned

Fri Nov 9 19:36:28 UTC 2012

Am 09.11.2012 19:33, schrieb Isaac Hailperin:

I did several hours of testing today with Isaac and there are two problems.

PROBLEM/BUG ONE:

First of all: The customer has 1.000 SSL-hosts on the nginx-Server, so
he wants to have 1000 listeners on TCP-Ports. But the cache_manager
isn't able to open so many listeners. He's crashing after 512 open
listeners. It looks very much like the cache_manager doesn't read the
worker_connections setting from nginx.conf.

We configured:

	worker_connections 10000;

there, but the cache_manager crashes with

2012/11/09 17:53:11 [alert] 9345#0: 512 worker_connections are not enough
2012/11/09 17:53:12 [alert] 9330#0: cache manager process 9344 exited
with fatal code 2 and cannot be respawned

I did some testing: Having 505 SSL-hosts on the Server (=505 listener
sockets) everything's working fine, but 515 listener sockets aren't
possible.

It's easy to reproduce: Just define 515 ssl-domains having different
TCP-ports for every domain. :-)

Looks like nobody had the idea before, that "somebody" (TM) could run
more then 2 times /24-network-IPs on one single host. In fact, this does
not happen in normal life...

But for historical reasons (TM) our customer uses ONE ip-address and
several TCP-Ports for that so he doesn't have a problem running so many
differend SSL-hosts on one system -- and this is the special situation
where we can see the bug (?), that the cache_manager ignores the
worker_connection-setting (?), when he tries to open all the listeners
and relating cache-files/sockets.

So: Looks like a bug? Who can help? We need help...

PROBLEM/BUG TWO:

Having 16 workers for 1000 ssl-domains with 1000 listeners, we can see
16 * 1000 open TCP-listeners on that system, because every worker open
it's own listeners (?). When we reach the magical barrier of 16386 open
listeners (lsof -i | grep -c nginx), nginx is running into some kind of
file limitations:

2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
"worker process" (24: Too many open files)
2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
"cache manager process" (24: Too many open files)
2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
"cache loader process" (24: Too many open files)

It's very easy to see, that the limitation is based on 16.386 open files
and sockets from nginx.

But I can't find the place, where this limitation comes from. "ulimit
-n" is set to 100.000, everything's looking fine and should work with
many more open files then just 16K.

Could it be, that "nobody" (TM) expected, that "somebody" (TM) runs more
then 1000 ssl-hosts with different TCP-ports on 16 worker-instances and
that there's some kind of SMALL-INT-problem in the nginx code? Could it
be, that this isn't a limitation from the linux system, but from some
kind of too small address-space for that in nginx?

So: Looks like a bug? Who can help? We need help...

Peer

-- 
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-42
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin