Failed disk + proxy_intercept_errors

Thu Feb 13 14:58:05 UTC 2020

Hello!

On Wed, Feb 12, 2020 at 10:36:54AM -0500, chocholo3 wrote:

> Hi,
> In our deployment we do have configuration of proxy cache with multiple hard
> drives. Because of performance we don't have any RAID on these devices. That
> means we have to handle even a situation when drive dies, sometime.
> 
> After disk failure of proxy_cache_path device nginx usually starts serving
> users with http500. So I've had an idea we may use proxy_intercept_errors
> but I end up with inconsistent state: ~60 files are handled as expected, but
> after that every connection is terminated prematurely without a single byte
> sent. In access.log there is http 200.
> 
> I broke just ext4 FS (dd if=/dev/zero of=/dev/sdc bs=1k count=$((1024*100)))
> and I'm using nginx 1.17.7 on Linux

[...]

> Am I doing something wrong or is this a bug? Because of the inconsistency I
> tend to the 2nd. But I'm not sure at all :-)

First of all, the proxy_intercept_errors directive is only 
relevant to errors returned by upstream servers. As long as the 
error is generated by nginx itself, only the error_page directives 
are relevant - as long as you have error_page 500 configured, 
nginx will appropriately redirect processing of errors with code 
500.

As for the inconsistency you observe, this depends on the exact 
moment the error happens.  For some errors nginx might be able to 
generate friendly 500, for some it won't and will close the 
connection as long as an error happens.

For example, if an error happens when reading cache header, nginx 
should be able to return 500.  But if an error happens later, when 
reading the response body from the cache file, when the response 
headers are already processed (and either sent to the client or 
buffered due to postpone_output), it certainly won't be possible 
to return a friendly error page, so nginx will close the 
connection.

Given the nature of your test, I suspect that the inconsistency 
you observe is due to errors happening at different moments.

In the real life, using "error_page 500" is certainly not enough 
to protect users from broken responses due to failing disks.  
Further, I don't think there is way to fully protect users, except 
by providing redundancy at the disk level.  For example, consider 
an error when reading some response body data from disk, with 1GB 
of the response body already sent to the client.  There is more or 
less nothing to be done here, and the only option is to close the 
connection.

-- 
Maxim Dounin
http://mdounin.ru/