Lots of CLOSE_WAIT sockets, nginx+php (WordPress site)
bisente at bisente.com
Sun Feb 21 13:19:48 MSK 2010
I have a WordPress-mu site (a couple personal and friends' blogs, very light traffic) which I migrated some months ago from lighttpd+php-fcgi to nginx+php-fcgi. Ever since the migration the site sometimes goes down, I never had the time to look into it and just programmed a script that monitored the site and restarted everything when it went down.
We're going to start using WP-mu at work so I've been looking into it lately and the problem seems to be browser-server connections stuck on the CLOSE_WAIT state. With netstat -nap I get loads of these:
$ netstat -nap | grep CLOSE_WAIT
tcp 1 0 10.10.10.10:80 220.127.116.11:52132 CLOSE_WAIT 27672/nginx: worker
tcp 1 0 10.10.10.10:80 18.104.22.168:52133 CLOSE_WAIT 27672/nginx: worker
tcp 1 0 10.10.10.10:80 22.214.171.124:50857 CLOSE_WAIT 27672/nginx: worker
tcp 1 0 10.10.10.10:80 126.96.36.199:51348 CLOSE_WAIT 27673/nginx: worker
tcp 1 0 10.10.10.10:80 188.8.131.52:50846 CLOSE_WAIT 27672/nginx: worker
tcp 1 0 10.10.10.10:80 184.108.40.206:52126 CLOSE_WAIT 27672/nginx: worker
tcp 1 0 10.10.10.10:80 220.127.116.11:52354 CLOSE_WAIT 27672/nginx: worker
Where 10.10.10.10 is the web server and 18.104.22.168 the browser. Right now I have 67 of these after having restarted nginx and doing some admin stuff on wp for a couple of minutes (CPU-intensive stuff, uploading, scaling and watermarking images with the NexGen Gallery plugin).
The connections between nginx and php doesn't seem to get stuck, they go from active to TIME_WAIT and disappear from netstat normally. They don't get stuck in the CLOSE_WAIT state:
$ netstat -nap | grep :9000
tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN 27662/php5-fpm
tcp 0 0 127.0.0.1:9000 127.0.0.1:52917 TIME_WAIT -
On friday I moved from spawn-fcgi+php-cgi to php-fpm to no avail. I've noticed some log entries on php5-fpm.log like these on the moments I'm working with wp and CLOSE_WAIT connections start to clog up:
Feb 21 10:48:45.080836 [NOTICE] fpm_got_signal(), line 48: received SIGCHLD
Feb 21 10:48:45.080918 [NOTICE] fpm_children_bury(), line 217: child 27665 (pool default) exited with code 0 after 35512.611171 seconds from start
Feb 21 10:48:45.089499 [NOTICE] fpm_children_make(), line 354: child 30370 (pool default) started
So I *guess* there might be a connection between the two. Anyway this is not a 1:1 ratio, right now I have 5 of those php SIGCHLD and 67 sockets on CLOSE_WAIT with nginx. And the php SIGCHILD relate to moments when I've got an error on wp (failed creating a thumbnail) while the CLOSE_WAIT connections are not related to application nor connectivity errors.
I'm almost sure that despite the CLOSE_WAIT sockets belong to the browser-nginx connections, the problems lies in the nginx-php connection. At work we have a farm of nginx+Tomcat servers (via proxy_pass, not fastcgi_pass) and I haven't seen this behavior. And I think it has to do with PHP CPU use, as the site usually went down when hit simultaneously by a couple visits and some search ngines' spiders and now I'm being able to reproduce it by scaling and watermarking pics. But I don't know where else to look at.
Anybody else has seen this behaviour?
Thanks in advance
Vicente Aguilar <bisente at bisente.com> | http://www.bisente.com
More information about the nginx