Cache doesn't expire as should

Wed Dec 23 15:35:57 UTC 2015

Hello!

On Wed, Dec 23, 2015 at 11:57:53AM +0300, Руслан Закиров wrote:

> On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:
> 
> > Hello!
> >
> > On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:
> >
> > [...]
> >
> > > Our error log has "ignore long locked inactive cache entry" alerts, but I
> > > really couldn't match it to "defreeze" event. Access log has
> > STALE/UPDATING
> > > requests between the alert and EXPIRED (cache updating) request.
> >
> > The "ignore long locked inactive cache entry" alerts indicate that
> > a cache entry was locked by some request, and wasn't unlocked for
> > a long time.  The alert is expected to appear if a cache node is
> > locked for cache inactive time (as set by proxy_cache_path
> > inactive=, 10 minutes by default).
> 
> 
> Inactive is defined in the config, but it's set to default 10m.
> 
> What happens with requests after this time? Do they hit backend and update
> cache? Do they use stale version?

As long as an entry is locked and not accessed for the inactive 
time, it will cause the alert in question to be logged by cache 
manager.  Nothing else happens - that is, the entry is still 
locked, and if it contains information that the entry is currently 
being updated, requests to this entry will return stale content 
with UPDATING status as per proxy_cache_use_stale.

This alert may in theory happen in normal situation if a backend 
response takes longer than the inactive time set.  But in general 
it indicates that there is a problem.

(Note well that "locked" here should not be confused with 
request-level locks as in proxy_cache_lock.)

> I'm going to check "long locked" messages in the log to see how many was
> for "/" location.
> The hash should be the same if we didn't change cache key, right?

Any message indicate that there is a problem.  If the "/" location 
is requested often, it'll likely won't be mentioned in the "ignore long 
locked" messages.

> > Most likely reason is that a worker died or was killed
> > while holding a lock on a cache node (i.e., while a request was
> > waiting for a new response from a backend).
> >
> 
> Shouldn't be there a record in error log? Error log level at warn.

There should be something like "worker process exited" messages, 
though at least some of them may be on the "notice" level (e.g., 
if you force nginx worker to quit by sending SIGTERM to it 
directly).

> > Trivial things to consider:
> >
> > - check logs for segmentation faults;
> 
> no seg faults in logs

Any other alerts, crit or emerg messages?  Any alerts about 
socket leaks on configuration reloads ("open socket left..."), 
sockets in CLOSED state?  How long relevant requests to upstream 
server take (i.e., make sure that the behaviour observed isn't a 
result of an upstream server just being very slow)?

> - if you are using 3rd party modules / patches, try without them;
> >
> 
> from freebsd port, updated gist [1] with `nginx -V` output
> [1] https://gist.github.com/ruz/05456767750715f6b54e

The only suspicious thing I see there is "ssi on" and 
"proxy_cache_lock" at the same time.  There were some fixes 
related to proxy_cache_lock with subrequests in more recent 
versions.

> > - make sure you don't kill worker processes yourself or using some
> >   automation scripts (in particular, don't try to terminate old
> >   worker processes after a configuration reload).
> >
> 
> One recent appearance of the problem was at 1:30AM and I checked
> logs for crazy midnight deploys - nothing.
> 
> Also, we don't use anything custom to restart nginx, just regular services
> management tools.

Just a regular kill will be enough screw up things.  On the other 
hand, it may be an nginx bug as well.  

Note though, that the version you are using is quite old.  
Debugging anything but 1.9.x hardly make sense.  You may want to 
try upgrading to the latest nginx-devel port.

Note well that if a problem happens, it's a good idea to either 
completely restart nginx or do binary upgrade procedure ("service 
nginx upgrade") to make sure shared memory structures are intact.

-- 
Maxim Dounin
http://nginx.org/