Cache manager occasionally stops deleting cached files

Thu Feb 25 10:46:32 UTC 2016

vedranf Wrote:
-------------------------------------------------------
> Hello,
> 
> I'm having an issue where nginx (1.8) cache manager suddenly just
> stops deleting content thus the disk soon ends up being full until I
> restart it by hand. After it is restarted, it works normally for a
> couple of days, but then it happens again. Cache has some 30-40k
> files, nothing huge. Relevant config lines are:
> 
>     proxy_cache_path    /home/cache/ levels=2:2 keys_zone=cache:25m
> inactive=7d max_size=2705g use_temp_path=on;
>     proxy_temp_path     /dev/shm/temp; # reduces parallel writes on
> the disk
>     proxy_cache_lock                on;
>     proxy_cache_lock_age        10s;
>     proxy_cache_lock_timeout    30s;
>     proxy_ignore_client_abort   on;
>     
> Server gets roughly 100 rps and normally cache manager deletes a
> couple of files every few seconds, however when it gets stuck this is
> all it does for 20-30 minutes or more, i.e. there are 0 unlinks (until
> I restart it and it rereads the on-disk cache):
> 
> ...
> epoll_wait(14, {}, 512, 1000)           = 0
> epoll_wait(14, {}, 512, 1000)           = 0
> epoll_wait(14, {}, 512, 1000)           = 0
> epoll_wait(14, {}, 512, 1000)           = 0
> gettid()                                = 11303
> write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
> inactive cache entry 380d3f178017bcd92877ee322b006bbb, count:1\n",
> 123) = 123
> gettid()                                = 11303
> write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
> inactive cache entry 7b9239693906e791375a214c7e36af8e, count:24\n",
> 124) = 124
> epoll_wait(14, {}, 512, 1000)           = 0
> ...
> 
> I assume the mentioned error is due to relatively often nginx restarts
> and is benign. There's nothing else in the error log (except for
> occasional upstream timeouts). I'm aware this likely isn't enough info
> to debug the issue, but do you at least have some ideas on what might
> be causing this issue, where to look? I'm wild guessing cache manager
> waits for some lock to be released, but it never gets released so it
> just waits indefinitely. 
> 
> Thanks,
> Vedran

We have the same problem, but i'm not sure, that this is caused by often
nginx restarts.

As far as i know problem exist since version 1.6 (maybe even earlier, 1.4.6
from ubuntu repo is not affected) till now (1.9.9)

I've collected related forum posts (should help analyze the problem):
https://forum.nginx.org/read.php?21,258292,258292#msg-258292
https://forum.nginx.org/read.php?21,260990,260990#msg-260990
https://forum.nginx.org/read.php?2,263625,263625#msg-263625

Also, i think it's somehow related to write connection leak. (see image
link)

https://s3.eu-central-1.amazonaws.com/drive-public-eu/nginx/betelgeuse_nginx_connections.PNG

Here we have our standard nginx configuration (before january, 28) with 7
days inactive time:

proxy_cache_path /mnt/cache1/nginx levels=2:2 keys_zone=a.d-1_cache:2143M
inactive=7d max_size=643G loader_sleep=1ms;

Every ~8 days (when writing connections reaches ~10k mark) cache starts
growing and fills the disk. Write connections falls on graph are nginx
restarts.

On january, 28 i changed inactive time to 8h. After write connections hits
~10k mark, nginx starts filling logs with "ignore long locked inactive cache
entry" message (1-2 messages per minute on average).

As you see write connections continuously grows. (When we had to power off
the machine it's reached ~60k).

For counting nginx connections we use standard http_stub_status_module.

I think that nginx "reference counter" could be broken, because total
established TCP connection remains the same all the time.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,264599,264819#msg-264819