Cache doesn't expire as should

Wed Dec 23 16:39:25 UTC 2015

On Wed, Dec 23, 2015 at 6:35 PM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Wed, Dec 23, 2015 at 11:57:53AM +0300, Руслан Закиров wrote:
>
> > On Tue, Dec 22, 2015 at 4:49 PM, Maxim Dounin <mdounin at mdounin.ru>
> wrote:
> >
> > > Hello!
> > >
> > > On Tue, Dec 22, 2015 at 02:43:11PM +0300, Руслан Закиров wrote:
> > >
> > > [...]
> > >
> > > > Our error log has "ignore long locked inactive cache entry" alerts,
> but I
> > > > really couldn't match it to "defreeze" event. Access log has
> > > STALE/UPDATING
> > > > requests between the alert and EXPIRED (cache updating) request.
> > >
> > > The "ignore long locked inactive cache entry" alerts indicate that
> > > a cache entry was locked by some request, and wasn't unlocked for
> > > a long time.  The alert is expected to appear if a cache node is
> > > locked for cache inactive time (as set by proxy_cache_path
> > > inactive=, 10 minutes by default).
> >
> >
> > Inactive is defined in the config, but it's set to default 10m.
> >
> > What happens with requests after this time? Do they hit backend and
> update
> > cache? Do they use stale version?
>
> As long as an entry is locked and not accessed for the inactive
> time, it will cause the alert in question to be logged by cache
> manager.  Nothing else happens - that is, the entry is still
> locked, and if it contains information that the entry is currently
> being updated, requests to this entry will return stale content
> with UPDATING status as per proxy_cache_use_stale.
>
> This alert may in theory happen in normal situation if a backend
> response takes longer than the inactive time set.  But in general
> it indicates that there is a problem.
>

uwsgi on backends has harakiri setting at 9 seconds and 10 seconds timeout
in
nginx. Index page takes 0.2-0.7 seconds to render.

(Note well that "locked" here should not be confused with
> request-level locks as in proxy_cache_lock.)
>
> > I'm going to check "long locked" messages in the log to see how many was
> > for "/" location.
> > The hash should be the same if we didn't change cache key, right?
>
> Any message indicate that there is a problem.  If the "/" location
> is requested often, it'll likely won't be mentioned in the "ignore long
> locked" messages.
>
> > > Most likely reason is that a worker died or was killed
> > > while holding a lock on a cache node (i.e., while a request was
> > > waiting for a new response from a backend).
> > >
> >
> > Shouldn't be there a record in error log? Error log level at warn.
>
> There should be something like "worker process exited" messages,
> though at least some of them may be on the "notice" level (e.g.,
> if you force nginx worker to quit by sending SIGTERM to it
> directly).
>
> > > Trivial things to consider:
> > >
> > > - check logs for segmentation faults;
> >
> > no seg faults in logs
>
> Any other alerts, crit or emerg messages?  Any alerts about
> socket leaks on configuration reloads ("open socket left..."),
> sockets in CLOSED state?  How long relevant requests to upstream
> server take (i.e., make sure that the behaviour observed isn't a
> result of an upstream server just being very slow)?
>

Looks like we have some misconfiguration and I get the following:

2015/12/17 01:49:13 [*crit*] 5232#0: *599273664 rename()
"/var/nginx/cache/0080683483" to
"/var/www/storage/img/ru-cyber/answers/10-5.jpg" failed (13: Permission
denied) while reading upstream, client: 188.115.151.67, server:
www.sports.ru, request: "GET /storage/img/ru-cyber/answers/10-5.jpg
HTTP/1.1", upstream: "
http://192.168.1.10:80/storage/img/ru-cyber/answers/10-5.jpg", host: "
www.sports.ru", referrer: "http://cyber.sports.ru/dota2/1034151992.html"

2015/12/17 01:50:17 [*crit*] 5229#0: *599313395 open()
"/var/www/storage/img/fantasy/shirts/ule/slo.png" failed (13: Permission
denied), client: 192.168.1.10, server: www.sports.ru, request: "GET
/storage/img/fantasy/shirts/ule/slo.png HTTP/1.0", host: "www.sports.ru",
referrer: "
http://fantasy-h2h.ru/analytics/fantasy_statistics/liga_europa_2015"

> > - if you are using 3rd party modules / patches, try without them;
> > >
> >
> > from freebsd port, updated gist [1] with `nginx -V` output
> > [1] https://gist.github.com/ruz/05456767750715f6b54e
>
> The only suspicious thing I see there is "ssi on" and
> "proxy_cache_lock" at the same time.  There were some fixes
> related to proxy_cache_lock with subrequests in more recent
> versions.
>
> > > - make sure you don't kill worker processes yourself or using some
> > >   automation scripts (in particular, don't try to terminate old
> > >   worker processes after a configuration reload).
> > >
> >
> > One recent appearance of the problem was at 1:30AM and I checked
> > logs for crazy midnight deploys - nothing.
> >
> > Also, we don't use anything custom to restart nginx, just regular
> services
> > management tools.
>
> Just a regular kill will be enough screw up things.  On the other
> hand, it may be an nginx bug as well.
>
> Note though, that the version you are using is quite old.
> Debugging anything but 1.9.x hardly make sense.  You may want to
> try upgrading to the latest nginx-devel port.
>

As I said before 1.8.0 made things worse. May be we will try nginx 1.9.x on
one of frontends and see how it goes.

>
> Note well that if a problem happens, it's a good idea to either
> completely restart nginx or do binary upgrade procedure ("service
> nginx upgrade") to make sure shared memory structures are intact.
>
> --
> Maxim Dounin
> http://nginx.org/
>
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>

-- 
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @  <http://www.sports.ru/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20151223/a67bddfd/attachment.html>