Dropped connections on our Nginx

Wed Jun 20 02:52:36 UTC 2012

Hi list

I have a problem with dropped connections on an Nginx cluster that has
up to 100k requests per minute per Nginx instance. It seems that in
around 1 of 10`000 requests that are sent to our Nginx, the TCP
connection just gets reset by the server. At first I was guessing that
some values in the /etc/sysctl.conf are maybe causing this problem,
because we have modified multiple TCP related values there. But after
resetting all of them to the default, the connection resets still kept
happening.

I am guessing the problem must be related to Nginx and not to a kernel
setting because in our traffic only around 25% of all requests are
POSTs and the rest are GETs, but more than 90% of the requests where
the problem appears are POSTs. I don't think that the kernel can be
aware if a request is POST or GET.

The problem happens on many different URLs, mostly ones where we POST
to, so it does not seem to be related to any rewrite rules.

I have tcpdumped the problem and I can see that the request was sent
correctly by the client. But after the request was received by the
Nginx, it only sends back a packet with the ACK and FIN flags set. So
the connection gets killed and most of the browsers display some empty
pages or "zero sized reply" errors. The fact that the FIN is sent by
the server makes me assume that the problem cannot be related to
network hardware. Also we have this problem on all Nginx instances
inside that cluster, so I don't think its related to broken networking
hardware.

When the problem happens, i see statements like this one in the access
log. As you can see the Nginx says HTTP status code and length both
are 0:
<ip> - - [20/Jun/2012:04:13:23 +0200] "POST
/userProfile/rateResult?userId=<id>&_csrf_token=7e23ef60c67800c4765d32b0536fc536&rate=5
HTTP/1.1" 0 0 "<referer>" "Mozilla/5.0 (X11; U; Linux x86_64; en-US;
rv:1.9.1.6) Gecko/20091216 Mandriva Linux/1.9.1.6-0.1mdv2010.0
(2010.0) Firefox/3.5.6"

What i also find very interesting is that the problem can happen at
any time, so it does not seem to be related to the load or number of
requests on the Nginx. In the morning hours we have less than 5% of
the traffic of the evening hours, and still I sometimes see this
problem appearing in the morning.

My Nginx config is very long, so its too long to post it here. So I
only post the parts which i think might be important, without all the
rewrite rules:

user wwwrun www;
worker_processes 64;
worker_rlimit_nofile 524288;

events {
   worker_connections 32768;
   use epoll;
   multi_accept on;
}

http {
   sendfile on;
   tcp_nopush on;
   keepalive_requests 0;
   recursive_error_pages on;
   large_client_header_buffers 4 16k;

What I also found via tcpdump is that on the requests where this
problem appears, the Nginx receives the incoming request and then
sends the correct request to the FastCGI backend and also receives the
correct answer from the backend, but before the answer from the
backend comes back (less than 300ms), it already resets the client's
connection.

Just in case this matters anyhow, this is my sysctl.conf:

net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.rp_filter = 1
fs.inotify.max_user_watches = 65536
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
net.netfilter.nf_conntrack_max = 262144
net.nf_conntrack_max = 262144
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 3
net.ipv4.tcp_keepalive_time = 120
net.core.wmem_max = 8388608
net.core.rmem_max = 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 87380 8388608
net.core.somaxconn = 1024
kernel.pid_max = 65536
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.default.log_martians = 0
net.ipv4.conf.lo.log_martians = 0
net.ipv4.conf.eth0.log_martians = 0
net.ipv4.conf.eth1.log_martians = 0

Our operating system is SuSE Linux Enterprise 11.0. The Nginx
configure params are the following:

nginx version: nginx/1.2.1
built by gcc 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux)
configure arguments: --prefix=/usr/local/nginx-1.2.1
--error-log-path=/var/log/nginx/error.log
--http-log-path=/var/log/nginx/access.log
--with-http_stub_status_module --without-http_autoindex_module
--without-http_geo_module --without-http_map_module
--without-http_referer_module --without-http_limit_conn_module
--without-http_empty_gif_module --without-mail_pop3_module
--without-mail_imap_module --without-mail_smtp_module
--with-http_geoip_module --with-pcre=/usr/local/src/nginx/pcre-8.30
--add-module=3rd/agentzh-nginx-eval-module-4eb2a02
--add-module=3rd/ngx_http_log_request_speed
--add-module=3rd/replay-ngx_http_generate_secure_download_links-4c1a46a
--add-module=3rd/agentzh-memc-nginx-module-8befc56
--add-module=3rd/agentzh-echo-nginx-module-080c0a1
--add-module=3rd/replay-ngx_http_php_memcache_standard_balancer-4f7dcba
--add-module=3rd/masterzen-nginx-upload-progress-module-a788dea
--add-module=3rd/replay-ngx_http_php_session-30f69b3
--add-module=3rd/simpl-ngx_devel_kit-24202b4
--add-module=3rd/chaoslawful-lua-nginx-module-c5be5ff
--add-module=3rd/replay-ngx_http_lower_upper_case-44958e0
--add-module=3rd/gnosek-nginx-upstream-fair-a18b409

In the dmesg I cannot see anything suspicious, there are no segfaults
or related networking messages.

I have already tried setting the Nginx error log to some high log
level, but I didn't see anything related to my problem, even at times
when I saw that the problem is happening.

Now I don't really know what else to check anymore... I would be
really glad if somebody had some ideas?

Thanks for help,

Mauro