Mark stale cache content as "invalid" on non-cacheable responses

Wed Nov 18 17:15:04 UTC 2015

Hello!

On Tue, Nov 17, 2015 at 10:22:00PM +0000, Mindaugas Rasiukevicius wrote:

> Maxim Dounin <mdounin at mdounin.ru> wrote:
> > > Context: consider nginx used as a cache with proxy_cache_use_stale set
> > > to 'http_500' and the 'updating' parameter set i.e. it caches errors and
> > > serves the stale content while updating.  Suppose the upstream
> > > temporarily responds with HTTP 504 and Cache-Control being max-age=3.
> > > The error gets cached, but after 3 seconds it expires.  At this point,
> > > let's say the upstream server starts serving HTTP 200 responses, but
> > > with Cache-Control set to 'no-cache'.
> > > 
> > > The cache manager will not LRU the expired content immediately; it will
> > > stay in the EXPIRED state while subsequent requests will result in 200s.
> > > Problem: if there are multiple processes racing, the ones in the
> > > UPDATING state will serve stale 504s.  That results in sporadic errors,
> > > e.g.:
> > > 
> > > 200 EXPIRED
> > > 504 UPDATING
> > > 200 EXPIRED
> > > ...
> > > 
> > > At the very least, I think the stale cache content should be marked as
> > > "invalid" after the no-cache response (with the possibility to become
> > > valid again if it becomes cacheable).  Whether the object should be kept
> > > at all is something to debate.
> > > 
> > > Please find the preliminary patch attached.
> > 
> > I don't see how a response with "no-cache" is no different from an 
> > earlier error.  Consider slightly different scenario:
> > 
> > - a response is cached and then expires,
> > 
> > - an attempt to fetch new response results in a non-cacheable 
> >   error.
> > 
> > In such a case, removing previously cached response is the worst 
> > thing we can possibly do.  We are expected to return previously 
> > cached stale responses in all cases we are configured to do so.
> > 
> > The change you've proposed completely rules out possibility of 
> > correct handling of this scenario.
> > 
> 
> In your scenario, the upstream server requested such behaviour; it is a
> transition point.

It didn't requested anything.  It merely returned an error.

> The "worst thing" also happens if the response would
> result in a temporary cacheable error.

And that's why returning a "temporary cacheable error" is a bad 
idea if you are using proxy_cache_use_stale.

> This is primarily a question of
> trusting/calibrating your upstream server (i.e. setting the Cache-Control
> headers) vs deliberately overriding it.  There is no "correct" handling
> in a general sense here, because this really depends on the caching layers
> you build or integrate with.

I agree: there is no correct handling if you don't know your 
upstream server behaviour.  By enabling use of stale responses you 
agree that your upstream server will behave accordingly.  In your 
scenario, the upstream server misbehaves, and this (expectedly) 
causes the problem.

> Also, I would argue that the expectation is to serve the stale content
> while the new content and its parameters are *unknown* (say, because, for
> instance, it is still being fetched).  The point here is that the upstream
> server has made it known by serving a 200 and indicating the desire for it
> to not be cached.  Let me put it this way: how else the upstream server
> could tell the cache in front that it has to exit the serve-stale state?
> Currently, nginx gets stuck -- the only way to eliminate those sporadic
> errors is to manually purge those stale files.

As of now, there is no way how upstream server can control how 
previously cached responses will be used to serve stale responses 
(if nginx is configured to do so).

You suggest to address it by making 200 + no-cache to be special 
and mean something "please remove anything cached".  This disagree 
with the code you've provided though, as it makes any non-cacheable 
response special.  Additionally, this disagree with various use 
cases when a non-cacheable response doesn't mean anything special, 
but rather an error, even if returned with status 200.  Or, in 
some more complicated setups, it may be just a user-specific 
response (which shouldn't be cached, in contrast to generic 
responses to the same resource).

> > Trivial solutions to the problem you've described would be to 
> > disable use of stale responses completely (which is the default), 
> > or use "proxy_cache_use_stale http_504", or to avoid caching of 
> > 504 errors (and the later is something RFC suggests to do by 
> > default with any errors).
> > 
> > And while I agree that it would be good to behave better in the 
> > scenario you've described, I tend to disagree with the change 
> > suggested, and I'm not even sure a good solution exists.
> 
> Right, whether 504s specifically (and other timeouts) should be cached is
> something what can be debated.  The real question here is what the users
> want to achieve with proxy_cache_use_stale.  It is a mechanism provided
> to avoid the redundant requests to the upstream server, right?  And one
> aspect in particular is caching the errors for very short time to defend
> a struggling or failing upstream server.  It hope we can agree that it is
> rather practical to recover from such state.

Caching errors is not something proxy_cache_use_stale was 
introduced for.  And this case rather contradicts 
proxy_cache_use_stale assumptions about upstream server behaviour.  
That is, two basic options are to either change the behaviour, or 
to avoid using "proxy_cache_use_stale updating".

> Sporadically serving errors makes users unhappy.  However, it is not even
> about the errors here.  You can also reproduce the problem with different
> content i.e. if the upstream server serves cacheable HTTP 200 (call it A)
> and then non-cacheable HTTP 200 (call it B).  Some clients will get A and
> some will get B (depending on who is winning the update race).  Hence the
> real problem is that nginx is not consistent: it serves different content
> based on a *race condition*.  How exactly is this beneficial or desirable?

This example is basically the same, so see above.

Again, I don't say current behaviour is good.  It has an obvious 
limitation, and it would be good to resolve this limitation.  But 
the solution proposed doesn't look like a good one either.

-- 
Maxim Dounin
http://nginx.org/