Memory use flares up sharply, how to troubleshoot?
gthb
nginx-forum at nginx.us
Mon Jul 21 21:44:45 UTC 2014
Hi,
I finally reproduced this, with debug logging enabled --- I found the
problematic request in the error log preceding the kill signal, saying it
was being buffered to a temporary file:
2014/07/21 11:39:39 [warn] 21182#0: *32332838 an upstream response is
buffered to a temporary file /var/cache/nginx/uwsgi_temp/9/90/0000186909
while reading upstream, client: x.x.x.x, server: foo.com, request: "GET
/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1 HTTP/1.1",
upstream: "uwsgi://123.45.67.89:3003", host: "foo.com"
2014/07/21 11:41:18 [alert] 16885#0: worker process 21182 exited on
signal 9
and retrying that request reproduces the problem, nginx growing in size
without bound. (The request never made it to the access log because of the
OOM kill, which is why my previous testing didn't reproduce it)
So here are debug log segments for single epoll events, during a healthy
streaming request (one that doesn't trigger this problem), and during the
problematic request. These requests *ought* to behave just the same (I'm not
aware of any behavior difference in our upstream app, invoked via
uwsgi_pass, except that the CSV lines are a little longer in the problematic
response).
Healthy request:
epoll: fd:43 ev:0005 d:00000000023FF960
*1 http upstream request:
"/api/well-behaved/request.csv?el=fnord!a=1.2:b=3.4.5:c&dates_as_dates=1"
*1 http upstream process upstream
*1 pipe read upstream: 1
*1 readv: 3:4096
*1 pipe recv chain: 44
*1 readv: 3:4096
*1 readv() not ready (11: Resource temporarily unavailable)
*1 pipe recv chain: -2
*1 pipe buf free s:0 t:1 f:0 00000000020EED30, pos 00000000020EED30,
size: 2267 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 000000000220B1C0, pos 000000000220B1C0,
size: 0 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001FE5900, pos 0000000001FE5900,
size: 0 file: 0, size: 0
*1 pipe length: -1
*1 pipe write downstream: 1
*1 pipe write busy: 0
*1 pipe write: out:0000000000000000, f:0
*1 pipe read upstream: 0
*1 pipe buf free s:0 t:1 f:0 00000000020EED30, pos 00000000020EED30,
size: 2267 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 000000000220B1C0, pos 000000000220B1C0,
size: 0 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001FE5900, pos 0000000001FE5900,
size: 0 file: 0, size: 0
*1 pipe length: -1
*1 event timer: 43, old: 1405973335524, new: 1405973335560
*1 http upstream request:
"/api/well-behaved/request.csv?el=fnord!a=1.2:b=3.4.5:c&dates_as_dates=1"
*1 http upstream dummy handler
timer delta: 0
posted events 0000000000000000
worker cycle
accept mutex lock failed: 0
epoll timer: 500
For the problematic request, the epoll events *often* look identical (all
the same kinds of lines, plus sometimes an extra pipe recv chain and readv
line pair, presumably just because of data being streamed in slightly bigger
chunks) ... but sometimes they have some extra lines, which I've highlighted
with an XXX prefix here:
epoll: fd:42 ev:0005 d:000000000150DE70
*1 http upstream request:
"/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
*1 http upstream process upstream
*1 pipe read upstream: 1
*1 readv: 3:4096
*1 pipe recv chain: 519
XXX *1 input buf #135
*1 readv: 2:4096
*1 readv() not ready (11: Resource temporarily unavailable)
*1 pipe recv chain: -2
XXX *1 pipe buf in s:1 t:1 f:0 0000000001417550, pos 0000000001417550,
size: 8192 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001447610, pos 0000000001447610,
size: 326 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 00000000014305C0, pos 00000000014305C0,
size: 0 file: 0, size: 0
*1 pipe length: -1
*1 pipe write downstream: 1
*1 pipe write busy: 0
XXX *1 pipe write buf ls:1 0000000001417550 8192
XXX *1 pipe write: out:0000000001431878, f:0
XXX *1 http output filter
"/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
XXX *1 http copy filter:
"/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
XXX *1 http postpone filter
"/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
0000000001431878
XXX *1 http chunk: 8192
XXX *1 write new buf t:1 f:0 0000000001431888, pos 0000000001431888, size: 6
file: 0, size: 0
XXX *1 write new buf t:1 f:0 0000000001417550, pos 0000000001417550, size:
8192 file: 0, size: 0
XXX *1 write new buf t:0 f:0 00000000014317E0, pos 00000000004ADFFD, size: 2
file: 0, size: 0
XXX *1 http write filter: l:0 f:1 s:8200
XXX *1 http write filter limit 0
XXX *1 writev: 8200
XXX *1 http write filter 0000000000000000
XXX *1 http copy filter: 0
"/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
XXX *1 pipe write busy: 0
*1 pipe write: out:0000000000000000, f:0
*1 pipe read upstream: 0
*1 pipe buf free s:0 t:1 f:0 0000000001447610, pos 0000000001447610,
size: 326 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 0000000001417550, pos 0000000001417550,
size: 0 file: 0, size: 0
*1 pipe buf free s:0 t:1 f:0 00000000014305C0, pos 00000000014305C0,
size: 0 file: 0, size: 0
*1 pipe length: -1
*1 event timer: 42, old: 1405971008168, new: 1405971008452
*1 http upstream request:
"/api/nasty/troublemaker.csv?el=xyzzy!a:b&dates_as_dates=1"
*1 http upstream dummy handler
timer delta: 1
posted events 0000000000000000
worker cycle
accept mutex lock failed: 0
epoll timer: 500
These extra lines *never* appear in the healthy requests, so I imagine they
point to the problem (but I am not at all familiar with Nginx debug output);
in particular those "write new buf" lines look relevant; they are output
right after ngx_alloc_chain_link is called.
All the possibly relevant Nginx config:
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '...';
access_log /var/log/nginx/access.log main buffer=32k;
sendfile on;
keepalive_timeout 65;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
gzip on;
gzip_min_length 1280;
gzip_types text/css application/json application/javascript text/csv
text/xml text/turtle;
log_format combined_upstream '...';
log_format internal_proxy '...';
uwsgi_hide_header X-...;
uwsgi_buffers 64 8k;
proxy_buffers 64 8k;
include /etc/nginx/sites-enabled/*;
}
upstream fooapp.foo.com {
server 123.45.67.89:3003 down;
server 123.45.67.88:3003 down;
server 123.45.67.87:3003;
server 123.45.67.86:3003;
least_conn;
}
server {
listen 80;
server_name foo.com;
listen 443 ssl;
ssl_certificate wildcard.foo.com.crt;
ssl_certificate_key wildcard.foo.com.crt.key;
ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
set_real_ip_from 123.45.67.89/27;
access_log /var/log/nginx/foo.com.access.log combined_upstream
buffer=8k flush=2s;
client_max_body_size 32m;
...
location /api/ {
include uwsgi_params;
uwsgi_pass fooapp-foo.com;
uwsgi_read_timeout 180;
}
}
There are no other occurrences of _buffer anywhere in the config:
$ sudo grep -r _buffer /etc/nginx/
/etc/nginx/nginx.conf: uwsgi_buffers 64 8k;
/etc/nginx/nginx.conf: proxy_buffers 64 8k;
Beginning of output of well-behaved request:
HTTP/1.1 200 OK
Server: nginx/1.6.0
Date: Mon, 21 Jul 2014 20:48:50 GMT
Content-Type: text/csv; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
Content-Disposition: attachment; filename=Foo-export.csv
Content-Language: en
Sex,Age group,ZIP code,Year-begin-date,Value
Female,Under 5 years,00601,2012-01-01,572
Female,Under 5 years,00602,2012-01-01,1132
Female,Under 5 years,00603,2012-01-01,1589
Female,Under 5 years,00606,2012-01-01,189
Female,Under 5 years,00610,2012-01-01,784
...
Beginning of output of problematic request:
HTTP/1.1 200 OK
Server: nginx/1.6.0
Date: Mon, 21 Jul 2014 20:49:07 GMT
Content-Type: text/csv; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
Content-Disposition: attachment; filename=Foo-export.csv
Content-Language: en
Location,Measurement,Month-begin-date,Value
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-01-01,30.6
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-02-01,35.5
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-03-01,41.4
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-04-01,41.6
SHARJAH INTER. AIRP (AE000041196),Temperature (max),2010-05-01,44.9
...
Does this narrow down the problem? Can I provide anything further?
Cheers,
Gulli
Posted at Nginx Forum: http://forum.nginx.org/read.php?2,251964,251971#msg-251971
More information about the nginx
mailing list