Lots of CLOSE_WAIT sockets, nginx+php (WordPress site)

Vicente Aguilar bisente at bisente.com
Sun Feb 21 13:19:48 MSK 2010


I have a WordPress-mu site (a couple personal and friends' blogs, very light traffic) which I migrated some months ago from lighttpd+php-fcgi to nginx+php-fcgi. Ever since the migration the site sometimes goes down, I never had the time to look into it and just programmed a script that monitored the site and restarted everything when it went down.

We're going to start using WP-mu at work so I've been looking into it lately and the problem seems to be browser-server connections stuck on the CLOSE_WAIT state. With netstat -nap I get loads of these:

$ netstat -nap | grep CLOSE_WAIT
tcp        1      0     CLOSE_WAIT  27672/nginx: worker
tcp        1      0     CLOSE_WAIT  27672/nginx: worker
tcp        1      0     CLOSE_WAIT  27672/nginx: worker
tcp        1      0     CLOSE_WAIT  27673/nginx: worker
tcp        1      0     CLOSE_WAIT  27672/nginx: worker
tcp        1      0     CLOSE_WAIT  27672/nginx: worker
tcp        1      0     CLOSE_WAIT  27672/nginx: worker

Where is the web server and the browser. Right now I have 67 of these after having restarted nginx and doing some admin stuff on wp for a couple of minutes (CPU-intensive stuff, uploading, scaling and watermarking images with the NexGen Gallery plugin).

The connections between nginx and php doesn't seem to get stuck, they go from active to TIME_WAIT and disappear from netstat normally. They don't get stuck in the CLOSE_WAIT state:

$ netstat -nap | grep :9000
tcp        0      0*               LISTEN      27662/php5-fpm  
tcp        0      0         TIME_WAIT   -    

On friday I moved from spawn-fcgi+php-cgi to php-fpm to no avail. I've noticed some log entries on php5-fpm.log like these on the moments I'm working with wp and CLOSE_WAIT connections start to clog up:

Feb 21 10:48:45.080836 [NOTICE] fpm_got_signal(), line 48: received SIGCHLD
Feb 21 10:48:45.080918 [NOTICE] fpm_children_bury(), line 217: child 27665 (pool default) exited with code 0 after 35512.611171 seconds from start
Feb 21 10:48:45.089499 [NOTICE] fpm_children_make(), line 354: child 30370 (pool default) started

So I *guess* there might be a connection between the two. Anyway this is not a 1:1 ratio, right now I have 5 of those php SIGCHLD and 67 sockets on CLOSE_WAIT with nginx. And the php SIGCHILD relate to moments when I've got an error on wp (failed creating a thumbnail) while the CLOSE_WAIT connections are not related to application nor connectivity errors.

I'm almost sure that despite the CLOSE_WAIT sockets belong to the browser-nginx connections, the problems lies in the nginx-php connection. At work we have a farm of nginx+Tomcat servers (via proxy_pass, not fastcgi_pass) and I haven't seen this behavior. And I think it has to do with PHP CPU use, as the site usually went down when hit simultaneously by a couple visits and some search ngines' spiders and now I'm being able to reproduce it by scaling and watermarking pics. But I don't know where else to look at.

Anybody else has seen this behaviour? 

Thanks in advance


  Vicente Aguilar <bisente at bisente.com> | http://www.bisente.com

More information about the nginx mailing list