[PATCH 4 of 4] AIO operations now add timers (ticket #2162)

Tue Jan 9 15:01:31 UTC 2024

Hello,

On Tue, 9 Jan 2024 08:59:14 +0300
Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
> 
> On Mon, Jan 08, 2024 at 01:31:11PM +0000, J Carter wrote:
> 
> > On Mon, 8 Jan 2024 11:25:55 +0000
> > J Carter <jordanc.carter at outlook.com> wrote:
> > 
> > > Hello,
> > > 
> > > On Mon, 27 Nov 2023 05:50:27 +0300
> > > Maxim Dounin <mdounin at mdounin.ru> wrote:
> > > 
> > > > # HG changeset patch
> > > > # User Maxim Dounin <mdounin at mdounin.ru>
> > > > # Date 1701050170 -10800
> > > > #      Mon Nov 27 04:56:10 2023 +0300
> > > > # Node ID 00c3e7333145ddb5ea0eeaaa66b3d9c26973c9c2
> > > > # Parent  61d08e4cf97cc073200ec32fc6ada9a2d48ffe51
> > > > AIO operations now add timers (ticket #2162).
> > > > 
> > > > Each AIO (thread IO) operation being run is now accompanied with 1-minute
> > > > timer.  This timer prevents unexpected shutdown of the worker process while
> > > > an AIO operation is running, and logs an alert if the operation is running
> > > > for too long.  
> > > 
> > > Shouldn't this timer's duration be set to match worker_shutdown_timeout's
> > > duration rather than being hard coded to 60s ?
> > 
> > Ah nevermind, I understand.
> > 
> > These timers will either expire from passing the 60s set duration, or 
> > will expire as worker_process_timeout itself expires, kills the 
> > connection and times out associated timers (including the aio timers). 
> > 
> > Setting it to worker_shutdown_timeout's duration would be pointless 
> > (an 'infinite' timer would give the same result).
> > 
> > So the only situation in which a different value for these AIO 
> > timers would make sense is if these AIO operations are expected to
> > take longer 60s, but less than worker_shutdown_timeout (in cases 
> > where it has been increased from it's own default of 60s).
> > 
> > In that case the AIO operation's timeout would have to be one 
> > (or more) of it's own directives, with a value less than 
> > worker_shutdown_timeout. 
> 
> Not really.
> 
> When worker_shutdown_timeout expires, it tries to terminate the 
> request, but it can't as long as an AIO operation is running.  
> When the AIO operation completes, the request will be actually 
> terminated and the worker process will be allowed to exit.  So far 
> so good.
> 
> But if the AIO operation never completes, the timer will expire 
> after 1 minute, will log an alert, and the worker processes will 
> be anyway allowed to exit (with the request still present).  This 
> might not be actually possible though - for example, 
> ngx_thread_pool_exit_worker() will just block waiting for all 
> pending operations to complete.
> 
> In theory, the situation when an AIO operation never completes 
> should never happen, and just a counter of pending AIO 
> operations can be used instead to delay shutdown (which is 
> essentially equivalent to an infinite timer).  
> 
> In practice, though, using a large enough guard timer might be 
> better: it provides additional level of protection against bugs or 
> system misbehaviour, and at least logs an alert if something 
> really weird happens.  It is also looks more universal and in line 
> with current approach of using existing timers as an indicator 
> that something is going on and shutdown should be delayed.
>
> The timer is expected to be "large enough", since we cannot do 
> anything meaningful with an AIO operation which never completes, 
> we can only complain loudly, so the timer should never expire 
> unless there is something really wrong.  This is not the case with 
> worker_shutdown_timeout: it can be set to an arbitrary low value, 
> which is not expected to mean that something is really wrong if 
> the timer expires, but rather means that nginx shouldn't try to 
> process remaining requests, but should instead close them as long 
> as it can do so.  That is, worker_shutdown_timeout does not fit 
> semantically.  Further, worker_shutdown_timeout is not set by 
> default, so it simply cannot be used.
>

Good point, for whatever reason I had it in my mind that was set
set by default (and you're right it doesn't fit in any case).

> The 1 minute was chosen as it matches default send_timeout, which 
> typically accompanies AIO operations when sending responses (and 
> also delays shutdown, so no "open socket left" alerts are normally 
> seen).  Still, send_timeout is also quite different semantically, 
> and therefore a hardcoded value is used instead.
> 
> I don't think there are valid cases when AIO operations can take 
> longer than 1 minute, as these are considered to be small and fast 
> operations, which are normally done synchronously within nginx 
> event loop when not using AIO, and an operations which takes 1m 
> would mean nginx is completely unresponsive.  Still, if we'll 
> found out that 1 minute is not large enough in some cases, 
> we can just bump it to a larger value.
>
> [...]
> 

Thanks for the detailed explanation - makes sense to me.