Cache doesn't expire as should
mdounin at mdounin.ru
Wed Dec 23 15:35:57 UTC 2015
On Wed, Dec 23, 2015 at 11:57:53AM +0300, Руслан Закиров wrote:
> On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:
> > Hello!
> > On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:
> > [...]
> > > Our error log has "ignore long locked inactive cache entry" alerts, but I
> > > really couldn't match it to "defreeze" event. Access log has
> > STALE/UPDATING
> > > requests between the alert and EXPIRED (cache updating) request.
> > The "ignore long locked inactive cache entry" alerts indicate that
> > a cache entry was locked by some request, and wasn't unlocked for
> > a long time. The alert is expected to appear if a cache node is
> > locked for cache inactive time (as set by proxy_cache_path
> > inactive=, 10 minutes by default).
> Inactive is defined in the config, but it's set to default 10m.
> What happens with requests after this time? Do they hit backend and update
> cache? Do they use stale version?
As long as an entry is locked and not accessed for the inactive
time, it will cause the alert in question to be logged by cache
manager. Nothing else happens - that is, the entry is still
locked, and if it contains information that the entry is currently
being updated, requests to this entry will return stale content
with UPDATING status as per proxy_cache_use_stale.
This alert may in theory happen in normal situation if a backend
response takes longer than the inactive time set. But in general
it indicates that there is a problem.
(Note well that "locked" here should not be confused with
request-level locks as in proxy_cache_lock.)
> I'm going to check "long locked" messages in the log to see how many was
> for "/" location.
> The hash should be the same if we didn't change cache key, right?
Any message indicate that there is a problem. If the "/" location
is requested often, it'll likely won't be mentioned in the "ignore long
> > Most likely reason is that a worker died or was killed
> > while holding a lock on a cache node (i.e., while a request was
> > waiting for a new response from a backend).
> Shouldn't be there a record in error log? Error log level at warn.
There should be something like "worker process exited" messages,
though at least some of them may be on the "notice" level (e.g.,
if you force nginx worker to quit by sending SIGTERM to it
> > Trivial things to consider:
> > - check logs for segmentation faults;
> no seg faults in logs
Any other alerts, crit or emerg messages? Any alerts about
socket leaks on configuration reloads ("open socket left..."),
sockets in CLOSED state? How long relevant requests to upstream
server take (i.e., make sure that the behaviour observed isn't a
result of an upstream server just being very slow)?
> - if you are using 3rd party modules / patches, try without them;
> from freebsd port, updated gist  with `nginx -V` output
>  https://gist.github.com/ruz/05456767750715f6b54e
The only suspicious thing I see there is "ssi on" and
"proxy_cache_lock" at the same time. There were some fixes
related to proxy_cache_lock with subrequests in more recent
> > - make sure you don't kill worker processes yourself or using some
> > automation scripts (in particular, don't try to terminate old
> > worker processes after a configuration reload).
> One recent appearance of the problem was at 1:30AM and I checked
> logs for crazy midnight deploys - nothing.
> Also, we don't use anything custom to restart nginx, just regular services
> management tools.
Just a regular kill will be enough screw up things. On the other
hand, it may be an nginx bug as well.
Note though, that the version you are using is quite old.
Debugging anything but 1.9.x hardly make sense. You may want to
try upgrading to the latest nginx-devel port.
Note well that if a problem happens, it's a good idea to either
completely restart nginx or do binary upgrade procedure ("service
nginx upgrade") to make sure shared memory structures are intact.
More information about the nginx