Strange behavior on proxy cache at high load spike

Mon May 4 21:07:32 UTC 2020

Hi,
this bugs me for some time now. I have nginx 1.16.0 configured as following
on proxy cache:

proxy_cache_path           /dev/shm/nginx_cache levels=1:2
keys_zone=proxy:1024m max_size=1024m inactive=60m;
proxy_temp_path            /dev/shm/nginx_proxy_tmp;
proxy_cache_use_stale      updating;
proxy_cache_lock           on;
proxy_cache_lock_timeout   30s;

Most of the time all is fine and working as expected. There is some
specialty in the deployment setup where some expected spikes in requests
(end clients updating daily data) to few locations occur. Response size
varies 1M-1.5M non-gziped. Log snippet from such spike:

[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
445984  cache: HIT request time: 50.211 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
780472  cache: HIT request time: 52.891 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200 
85432  cache: HIT request time: 33.284 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200 
57920  cache: HIT request time: 34.957 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
401096  cache: HIT request time: 49.991 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
244712  cache: HIT request time: 48.412 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
101360  cache: HIT request time: 34.955 sec 
[2020-05-03T00:00:44] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
102808  cache: HIT request time: 34.753 sec
...                                                                         

[2020-03-24T00:02:16] "GET /api/34/guide?date=2020-05-03 HTTP/1.0" 200
1526025  cache: HIT request time: 48.671 sec

Monitoring du on cache location shows max 1.1G, like:
1.1G    /dev/shm/nginx_cache
0       /dev/shm/nginx_proxy_tmp

After 2minutes response 'stabilizes' with correct size (in this example
1526025). Problem is also amplified due clients validate response and retry
progressively if corrupted.

There are no weird log lines in error log or linux (centos) messages, also
there is no cache 'updating', just hits (I guess this omits upstream servers
issue). Is it possible we have issue with reading cached entries from
/dev/shm during peak times? 

I would kindly ask for hints where possibly to start looking and debugging?
Big thanks in advance

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,287951,287951#msg-287951