Failed disk + proxy_intercept_errors
mdounin at mdounin.ru
Thu Feb 13 14:58:05 UTC 2020
On Wed, Feb 12, 2020 at 10:36:54AM -0500, chocholo3 wrote:
> In our deployment we do have configuration of proxy cache with multiple hard
> drives. Because of performance we don't have any RAID on these devices. That
> means we have to handle even a situation when drive dies, sometime.
> After disk failure of proxy_cache_path device nginx usually starts serving
> users with http500. So I've had an idea we may use proxy_intercept_errors
> but I end up with inconsistent state: ~60 files are handled as expected, but
> after that every connection is terminated prematurely without a single byte
> sent. In access.log there is http 200.
> I broke just ext4 FS (dd if=/dev/zero of=/dev/sdc bs=1k count=$((1024*100)))
> and I'm using nginx 1.17.7 on Linux
> Am I doing something wrong or is this a bug? Because of the inconsistency I
> tend to the 2nd. But I'm not sure at all :-)
First of all, the proxy_intercept_errors directive is only
relevant to errors returned by upstream servers. As long as the
error is generated by nginx itself, only the error_page directives
are relevant - as long as you have error_page 500 configured,
nginx will appropriately redirect processing of errors with code
As for the inconsistency you observe, this depends on the exact
moment the error happens. For some errors nginx might be able to
generate friendly 500, for some it won't and will close the
connection as long as an error happens.
For example, if an error happens when reading cache header, nginx
should be able to return 500. But if an error happens later, when
reading the response body from the cache file, when the response
headers are already processed (and either sent to the client or
buffered due to postpone_output), it certainly won't be possible
to return a friendly error page, so nginx will close the
Given the nature of your test, I suspect that the inconsistency
you observe is due to errors happening at different moments.
In the real life, using "error_page 500" is certainly not enough
to protect users from broken responses due to failing disks.
Further, I don't think there is way to fully protect users, except
by providing redundancy at the disk level. For example, consider
an error when reading some response body data from disk, with 1GB
of the response body already sent to the client. There is more or
less nothing to be done here, and the only option is to close the
More information about the nginx