Cache manager occasionally stops deleting cached files

Thu Feb 18 17:11:18 UTC 2016

Hello!

On Thu, Feb 18, 2016 at 11:20:55AM -0500, vedranf wrote:

> Hello,
> 
> I'm having an issue where nginx (1.8) cache manager suddenly just stops
> deleting content thus the disk soon ends up being full until I restart it by
> hand. After it is restarted, it works normally for a couple of days, but
> then it happens again. Cache has some 30-40k files, nothing huge. Relevant
> config lines are:
> 
>     proxy_cache_path    /home/cache/ levels=2:2 keys_zone=cache:25m
> inactive=7d max_size=2705g use_temp_path=on;
>     proxy_temp_path     /dev/shm/temp; # reduces parallel writes on the
> disk
>     proxy_cache_lock                on;
>     proxy_cache_lock_age        10s;
>     proxy_cache_lock_timeout    30s;
>     proxy_ignore_client_abort   on;
>     
> Server gets roughly 100 rps and normally cache manager deletes a couple of
> files every few seconds, however when it gets stuck this is all it does for
> 20-30 minutes or more, i.e. there are 0 unlinks (until I restart it and it
> rereads the on-disk cache):
> 
> ...
> epoll_wait(14, {}, 512, 1000)           = 0
> epoll_wait(14, {}, 512, 1000)           = 0
> epoll_wait(14, {}, 512, 1000)           = 0
> epoll_wait(14, {}, 512, 1000)           = 0
> gettid()                                = 11303
> write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
> inactive cache entry 380d3f178017bcd92877ee322b006bbb, count:1\n", 123) =
> 123
> gettid()                                = 11303
> write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
> inactive cache entry 7b9239693906e791375a214c7e36af8e, count:24\n", 124) =
> 124
> epoll_wait(14, {}, 512, 1000)           = 0
> ...
> 
> I assume the mentioned error is due to relatively often nginx restarts and
> is benign. There's nothing else in the error log (except for occasional
> upstream timeouts). I'm aware this likely isn't enough info to debug the
> issue, but do you at least have some ideas on what might be causing this
> issue, where to look? I'm wild guessing cache manager waits for some lock to
> be released, but it never gets released so it just waits indefinitely. 

The error logged is due to an entry nginx is going to remove an 
inactive cache entry but it is locked by some requests.  Unless 
inactive time is very low (not your case) it indicate a problem 
somewhere else.

Such locked entries can't be removed from cache.  Addtitionally, 
once there are enough such locked entries, nginx won't be able to 
purge cache based on max_size.  That is, it's expected that nginx 
will have problems with removing entries from cache if you see 
such messages.

Most trivial reason for such messages is abnormally killed nginx 
processes.  That is, if some processes die due to bugs, or killed 
by an unwary administrator or an incorrect script - the problem 
will appear sooner or later.

To further debug things, try the following:

- restart nginx and record pids of all nginx processes;

- once the problem starts to appear again, check if there are the 
  same processes running;

- if some processes different from one recorded, dig further to 
  find out why.

Some trivial things like looking into logs for "worker process 
exited ..." messages and checking if the problem persists without 
3rd party modules compiled in (see "nginx -V") may also help.

-- 
Maxim Dounin
http://nginx.org/