Mark stale cache content as "invalid" on non-cacheable responses

Tue Nov 17 22:22:00 UTC 2015

Maxim Dounin <mdounin at mdounin.ru> wrote:
> > Context: consider nginx used as a cache with proxy_cache_use_stale set
> > to 'http_500' and the 'updating' parameter set i.e. it caches errors and
> > serves the stale content while updating.  Suppose the upstream
> > temporarily responds with HTTP 504 and Cache-Control being max-age=3.
> > The error gets cached, but after 3 seconds it expires.  At this point,
> > let's say the upstream server starts serving HTTP 200 responses, but
> > with Cache-Control set to 'no-cache'.
> > 
> > The cache manager will not LRU the expired content immediately; it will
> > stay in the EXPIRED state while subsequent requests will result in 200s.
> > Problem: if there are multiple processes racing, the ones in the
> > UPDATING state will serve stale 504s.  That results in sporadic errors,
> > e.g.:
> > 
> > 200 EXPIRED
> > 504 UPDATING
> > 200 EXPIRED
> > ...
> > 
> > At the very least, I think the stale cache content should be marked as
> > "invalid" after the no-cache response (with the possibility to become
> > valid again if it becomes cacheable).  Whether the object should be kept
> > at all is something to debate.
> > 
> > Please find the preliminary patch attached.
> 
> I don't see how a response with "no-cache" is no different from an 
> earlier error.  Consider slightly different scenario:
> 
> - a response is cached and then expires,
> 
> - an attempt to fetch new response results in a non-cacheable 
>   error.
> 
> In such a case, removing previously cached response is the worst 
> thing we can possibly do.  We are expected to return previously 
> cached stale responses in all cases we are configured to do so.
> 
> The change you've proposed completely rules out possibility of 
> correct handling of this scenario.
> 

In your scenario, the upstream server requested such behaviour; it is a
transition point.  The "worst thing" also happens if the response would
result in a temporary cacheable error.  This is primarily a question of
trusting/calibrating your upstream server (i.e. setting the Cache-Control
headers) vs deliberately overriding it.  There is no "correct" handling
in a general sense here, because this really depends on the caching layers
you build or integrate with.

Also, I would argue that the expectation is to serve the stale content
while the new content and its parameters are *unknown* (say, because, for
instance, it is still being fetched).  The point here is that the upstream
server has made it known by serving a 200 and indicating the desire for it
to not be cached.  Let me put it this way: how else the upstream server
could tell the cache in front that it has to exit the serve-stale state?
Currently, nginx gets stuck -- the only way to eliminate those sporadic
errors is to manually purge those stale files.

> Trivial solutions to the problem you've described would be to 
> disable use of stale responses completely (which is the default), 
> or use "proxy_cache_use_stale http_504", or to avoid caching of 
> 504 errors (and the later is something RFC suggests to do by 
> default with any errors).
> 
> And while I agree that it would be good to behave better in the 
> scenario you've described, I tend to disagree with the change 
> suggested, and I'm not even sure a good solution exists.
> 

Right, whether 504s specifically (and other timeouts) should be cached is
something what can be debated.  The real question here is what the users
want to achieve with proxy_cache_use_stale.  It is a mechanism provided
to avoid the redundant requests to the upstream server, right?  And one
aspect in particular is caching the errors for very short time to defend
a struggling or failing upstream server.  It hope we can agree that it is
rather practical to recover from such state.

Sporadically serving errors makes users unhappy.  However, it is not even
about the errors here.  You can also reproduce the problem with different
content i.e. if the upstream server serves cacheable HTTP 200 (call it A)
and then non-cacheable HTTP 200 (call it B).  Some clients will get A and
some will get B (depending on who is winning the update race).  Hence the
real problem is that nginx is not consistent: it serves different content
based on a *race condition*.  How exactly is this beneficial or desirable?

-- 
Mindaugas