Mark stale cache content as "invalid" on non-cacheable responses

Wed Nov 18 20:25:58 UTC 2015

Hello!

On Wed, Nov 18, 2015 at 06:56:38PM +0000, Mindaugas Rasiukevicius wrote:

> Maxim Dounin <mdounin at mdounin.ru> wrote:
> > <...>
> > > 
> > > In your scenario, the upstream server requested such behaviour; it is a
> > > transition point.
> > 
> > It didn't requested anything.  It merely returned an error.
> > 
> 
> I am afraid I cannot agree with this.  Cache-Control is a directive which
> requests certain behaviour from a cache.  Think of 'no-cache' as a barrier
> marking the necessary transition point.  RFC 7234 section 4.2.4 ("Serving
> Stale Responses") seems to be clear on the stale case too (section 4 also
> makes an obvious point that the most recent response should be obeyed):
> 
>    A cache MUST NOT generate a stale response if it is prohibited by an
>    explicit in-protocol directive (e.g., by a "no-store" or "no-cache"
>    cache directive, a "must-revalidate" cache-response-directive, or an
>    applicable "s-maxage" or "proxy-revalidate" cache-response-directive;
>    see Section 5.2.2).

The response stored in cache doesn't have "no-cache" nor any other 
directives in it, and this "MUST NOT" certainly doesn't apply to 
it.

In the scenario I've described, the response in cache is a correct 
(but stale) response, as returned by an upstream server when it 
was up and running normally.

In the scenario you've described, the response in cache is a 
"temporary cacheable error", and it doesn't have any directives 
attached to it either.

> > > The "worst thing" also happens if the response would
> > > result in a temporary cacheable error.
> > 
> > And that's why returning a "temporary cacheable error" is a bad 
> > idea if you are using proxy_cache_use_stale.
> > 
> > > This is primarily a question of
> > > trusting/calibrating your upstream server (i.e. setting the
> > > Cache-Control headers) vs deliberately overriding it.  There is no
> > > "correct" handling in a general sense here, because this really depends
> > > on the caching layers you build or integrate with.
> > 
> > I agree: there is no correct handling if you don't know your 
> > upstream server behaviour.  By enabling use of stale responses you 
> > agree that your upstream server will behave accordingly.  In your 
> > scenario, the upstream server misbehaves, and this (expectedly) 
> > causes the problem.
> 
> Why temporary caching of an error is a bad idea?  The upstream server
> in my example had such configuration deliberately, it did not misbehave.
> For the given URI it does serve the dynamic content which must never be
> cached.  However, it has a more general policy asking to cache the errors
> for 3 seconds.  This is to defend the potentially struggling or failing
> origin.  It seems like a quite practical reason; I think it is something
> used quite commonly in the industry.

The problem is that such a configuration isn't compatible with 
"proxy_cache_use_stale updating" assumptions about the upstream 
behaviour.

> > > Also, I would argue that the expectation is to serve the stale content
> > > while the new content and its parameters are *unknown* (say, because,
> > > for instance, it is still being fetched).  The point here is that the
> > > upstream server has made it known by serving a 200 and indicating the
> > > desire for it to not be cached.  Let me put it this way: how else the
> > > upstream server could tell the cache in front that it has to exit the
> > > serve-stale state? Currently, nginx gets stuck -- the only way to
> > > eliminate those sporadic errors is to manually purge those stale files.
> > 
> > As of now, there is no way how upstream server can control how 
> > previously cached responses will be used to serve stale responses 
> > (if nginx is configured to do so).
> 
> Again, the way I interpret RFC, is that the Cache-Control header *is*
> the way.

The Cache-Control header allows you to control cacheability of a 
particular response, and returning 504 errors with "Cache-Control: 
no-cache" will resolve the problem in your scenario.  Though I see 
no reasons why Cache-Control on another response should be 
appliciable to a previously stored response - in general it's not 
possible at all, as a response may be returned to a different 
client.

> > You suggest to address it by making 200 + no-cache to be special 
> > and mean something "please remove anything cached".  This disagree 
> > with the code you've provided though, as it makes any non-cacheable 
> > response special.  Additionally, this disagree with various use 
> > cases when a non-cacheable response doesn't mean anything special, 
> > but rather an error, even if returned with status 200.  Or, in 
> > some more complicated setups, it may be just a user-specific 
> > response (which shouldn't be cached, in contrast to generic 
> > responses to the same resource).
> 
> In the original case, nginx sporadically throws errors at users when there
> is no real error, while temporarily caching errors when they indeed happen
> is a beneficial and desired feature.  However, I do not think it really
> matters whether one of the responses is an error or not.  Let's talk about
> the generic case.  If we have a sequence of cacheable responses and then a
> response with the Cache-Control header set to 'no-cache', then I believe
> the cache must invalidate that content.  Because otherwise it does not obey
> the upstream server and does not preserve the consistency of the content.

As explained, the upstream server has no way to say something 
additional about a response it returned previously.

> Let's put it this way: what is your use case i.e. when is such behaviour
> problematic?  If you have a location (object or page) where the upstream
> server constantly mixes "cache me" and "don't cache me", then there is no
> point to cache it (i.e. it is inherently not cacheable content which just
> busts your cache anyway).

I've already described at least 2 use cases where current 
behaviour works fine, and the one you suggests is problematic.  
Again:

Use case 1, a cache with possible errors:

A high traffic resource, which normally can be cached for a long 
time, but takes a long time to generate.  A response is stored in 
the cache, and "proxy_cache_use_stale updating" is used to prevent 
multiple clients from updating the cache at the same time.  If at 
some point a request to update the cache fails / times out, an 
"degraded" version is returned with caching disabled (this can be 
an error, or normal response without some data).  The response 
previously stored in the cache is preserved and will be returned 
to other clients while we'll try to update the cache again.

Use case 2, a cache with non-cacheable private responses:

A resource has two versions: one is "general" and can/should be 
cached (e.g., "guest user" version of a page), and another one 
is private and should not be cached by nginx ("logged in user" 
version).  The "proxy_cache_bypass" directive is used to determine 
if a cached version can be returned, or a request to an upstream 
server is needed.  "Logged in" responses are returned with disabled 
caching, while "guest user" responses are cacheable.

Both use cases are real.  First one is basically a use case the 
"proxy_cache_use_stale updating" was originally introduced for.  
Second one is something often seen in the mailing list as 
configured by various nginx users.  Both will be broken by your 
patch.

> > > Right, whether 504s specifically (and other timeouts) should be cached
> > > is something what can be debated.  The real question here is what the
> > > users want to achieve with proxy_cache_use_stale.  It is a mechanism
> > > provided to avoid the redundant requests to the upstream server,
> > > right?  And one aspect in particular is caching the errors for very
> > > short time to defend a struggling or failing upstream server.  It hope
> > > we can agree that it is rather practical to recover from such state.
> > 
> > Caching errors is not something proxy_cache_use_stale was 
> > introduced for.  And this case rather contradicts 
> > proxy_cache_use_stale assumptions about upstream server behaviour.  
> > That is, two basic options are to either change the behaviour, or 
> > to avoid using "proxy_cache_use_stale updating".
> 
> Perhaps it was not, but it provides such option and the option is used in
> the wildness.  Again, the presence of error here does not matter much as
> the real problem is obeying the upstream server directives and preserving
> the consistency.

The two options suggested still apply: either change the upstream 
server behaviour to match "proxy_cache_use_stale updating" 
assumptions (basically, don't try to convert a cacheable resource 
to non-cacheable one), or switch it off.

> > > Sporadically serving errors makes users unhappy.  However, it is not
> > > even about the errors here.  You can also reproduce the problem with
> > > different content i.e. if the upstream server serves cacheable HTTP 200
> > > (call it A) and then non-cacheable HTTP 200 (call it B).  Some clients
> > > will get A and some will get B (depending on who is winning the update
> > > race).  Hence the real problem is that nginx is not consistent: it
> > > serves different content based on a *race condition*.  How exactly is
> > > this beneficial or desirable?
> > 
> > This example is basically the same, so see above.
> > 
> 
> Right, it just a good illustration of the consistency problem.  I do not
> really see a conceptual between the current nginx behaviour and a database
> sporadically returning the result of some old transaction.  It's broken.

See above.  It's expected to be broken if you try to use it in 
conditions it's not expected to be used.

> > Again, I don't say current behaviour is good.  It has an obvious 
> > limitation, and it would be good to resolve this limitation.  But 
> > the solution proposed doesn't look like a good one either.
> 
> Okay, so what solution do you propose?

As I already wrote in the very first reply, I'm not even sure a 
good solution exists.  May be some timeouts like ones proposed by 
rfc5861 will work (though this will limit various "use stale" cases 
considerably with low timeouts, and won't help much with high 
ones).  Or may be we can introduce some counters/heuristics to 
detect cacheable->uncacheable transitions.  May be just enforcing 
"inactive" time on such resources regardless of actual requests 
will work (but unlikely, as an upstream server can be down for a 
considerable time in some cases).

-- 
Maxim Dounin
http://nginx.org/