Lots of CLOSE_WAIT sockets, nginx+php (WordPress site)

Sun Feb 21 13:19:48 MSK 2010

Hi

I have a WordPress-mu site (a couple personal and friends' blogs, very light traffic) which I migrated some months ago from lighttpd+php-fcgi to nginx+php-fcgi. Ever since the migration the site sometimes goes down, I never had the time to look into it and just programmed a script that monitored the site and restarted everything when it went down.

We're going to start using WP-mu at work so I've been looking into it lately and the problem seems to be browser-server connections stuck on the CLOSE_WAIT state. With netstat -nap I get loads of these:

$ netstat -nap | grep CLOSE_WAIT
tcp        1      0 10.10.10.10:80        1.2.3.4:52132     CLOSE_WAIT  27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:52133     CLOSE_WAIT  27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:50857     CLOSE_WAIT  27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:51348     CLOSE_WAIT  27673/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:50846     CLOSE_WAIT  27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:52126     CLOSE_WAIT  27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:52354     CLOSE_WAIT  27672/nginx: worker
[...]

Where 10.10.10.10 is the web server and 1.2.3.4 the browser. Right now I have 67 of these after having restarted nginx and doing some admin stuff on wp for a couple of minutes (CPU-intensive stuff, uploading, scaling and watermarking images with the NexGen Gallery plugin).

The connections between nginx and php doesn't seem to get stuck, they go from active to TIME_WAIT and disappear from netstat normally. They don't get stuck in the CLOSE_WAIT state:

$ netstat -nap | grep :9000
tcp        0      0 127.0.0.1:9000          0.0.0.0:*               LISTEN      27662/php5-fpm  
tcp        0      0 127.0.0.1:9000          127.0.0.1:52917         TIME_WAIT   -    
[...]

On friday I moved from spawn-fcgi+php-cgi to php-fpm to no avail. I've noticed some log entries on php5-fpm.log like these on the moments I'm working with wp and CLOSE_WAIT connections start to clog up:

Feb 21 10:48:45.080836 [NOTICE] fpm_got_signal(), line 48: received SIGCHLD
Feb 21 10:48:45.080918 [NOTICE] fpm_children_bury(), line 217: child 27665 (pool default) exited with code 0 after 35512.611171 seconds from start
Feb 21 10:48:45.089499 [NOTICE] fpm_children_make(), line 354: child 30370 (pool default) started

So I *guess* there might be a connection between the two. Anyway this is not a 1:1 ratio, right now I have 5 of those php SIGCHLD and 67 sockets on CLOSE_WAIT with nginx. And the php SIGCHILD relate to moments when I've got an error on wp (failed creating a thumbnail) while the CLOSE_WAIT connections are not related to application nor connectivity errors.

I'm almost sure that despite the CLOSE_WAIT sockets belong to the browser-nginx connections, the problems lies in the nginx-php connection. At work we have a farm of nginx+Tomcat servers (via proxy_pass, not fastcgi_pass) and I haven't seen this behavior. And I think it has to do with PHP CPU use, as the site usually went down when hit simultaneously by a couple visits and some search ngines' spiders and now I'm being able to reproduce it by scaling and watermarking pics. But I don't know where else to look at.

Anybody else has seen this behaviour? 

Thanks in advance

Regards

-- 
  Vicente Aguilar <bisente at bisente.com> | http://www.bisente.com