[crit] 16665#0 unlink()

Mon May 6 14:21:29 UTC 2013

On 05/06/13 09:54, Maxim Dounin wrote:
> Hello!
>
> On Mon, May 06, 2013 at 09:01:45AM -0400, Jim Ohlstein wrote:
>
>> On 05/05/13 16:32, Maxim Dounin wrote:
>>> Hello!
>>>
>>> On Sat, May 04, 2013 at 07:08:55PM -0400, Jim Ohlstein wrote:
>>>
>>> [...]
>>>
>>>> I have just seen a similar situation using fastcgi cache. In my case
>>>> I am using the same cache (but only one cache) for several
>>>> server/location blocks. The system is a fairly basic nginx set up
>>>> with four upstream fastcgi servers and ip hash. The returned content
>>>> is cached locally by nginx. The cache is rather large but I wouldn't
>>>> think this would be the cause.
>>>
>>> [...]
>>>
>>>>      fastcgi_cache_path /var/nginx/fcgi_cache levels=1:2
>>>> keys_zone=one:512m max_size=250g inactive=24h;
>>>
>>> [...]
>>>
>>>> The other sever/location blocks are pretty much identical insofar as
>>>> fastcgi and cache are concerned.
>>>>
>>>> When I upgraded nginx using the "on the fly" binary upgrade method,
>>>> I saw almost 400,000 lines in the error log that looked like this:
>>>>
>>>> 2013/05/04 17:54:25 [crit] 65304#0: unlink()
>>>> "/var/nginx/fcgi_cache/7/2e/899bc269a74afe6e0ad574eacde4e2e7" failed
>>>> (2: No such file or directory)
>>>
>>> [...]
>>>
>>> After binary upgrade there are two cache zones - one in old nginx,
>>> and another one in new nginx (much like in originally posted
>>> configuration).  This may cause such errors if e.g. a cache file
>>> is removed by old nginx, and new nginx fails to remove the file
>>> shortly after.
>>>
>>> The 400k lines is a bit too many though.  You may want to check
>>> that the cache wasn't just removed by some (package?) script
>>> during the upgrade process.  Alternatively, it might indicate that
>>> you let old and new processes to coexist for a long time.
>>
>> I hadn't considered that there are two zones during that short time.
>> Thanks for pointing that out.
>>
>> To my knowledge, there are no scripts or packages which remove files
>> from the cache, or the entire cache. A couple of minutes after this
>> occurred there were a bit under 1.4 million items in the cache and
>> it was "full" at 250 GB. I did look in a few sub-directories at the
>> time, and most of the items were time stamped from before this
>> started so clearly the entire cache was not removed. During the time
>> period these entries were made in the error log, and in the two
>> minutes after, access log entries show the expected ratio of "HIT"
>> and "MISS" entries which further supports your point below that
>> these are harmless (although I don't really believe that I have a
>> cause).
>>
>> I'm not sure what you mean by a "long time" but all of these entries
>> are time stamped over over roughly two and a half minutes.
>
> Is it ok in your setup that 400k cache items are removed/expired
> from cache in two minutes?  If yes, then it's probably ok.

No, that is way more than expected. The box handles an average of 
300-500 requests/second during peak hours, spiking around 800-900. so 
that would be at most around 150,000 requests in three minutes. Even if 
150,000 requests were all cache-able and were all cache misses 
(resulting in them all expiring at the same time in the future) that 
could not explain all of those items. FWIW, this upgrade was done on a 
weekend. Peak times are "business hours" in Europe and North America. 
The box was relatively slow at that time.

>
>>> On the other hand, as discussed many times - such errors are more
>>> or less harmless as soon as it's clear what caused cache files to
>>> be removed.  At worst they indicate that information in a cache
>>> zone isn't correct and max_size might not be maintained properly,
>>> and eventually nginx will self-heal the cache zone.  It probably
>>> should be logged at [error] or even [warn] level instead.
>>>
>>
>> Why would max_size not be maintained properly? Isn't that the
>> responsibility cache manager process? Are there known issues/bugs?
>
> Cache manager process uses the same shared memory zone to maintain
> max_size.  And if nginx thinks a cache file is here, but the file
> was in fact already deleted (this is why alerts in question
> appear) - total size of the cache as recorded in the shared memory
> will be incorrect.  As a result cache manager will delete some
> extra files to keep (incorrect) size under max_size.
>
> In a worst case cache size will be again correct after inactive=
> time passes after cache files were deleted.
>

OK, that makes sense and is what I would expect. I'm still troubled by 
how many items there were in discrepancy.

I will watch and see what happens the next time I upgrade. I'll look at 
how many items are in the cache directory before and after, as well as 
the total size, which was on the mark after the upgrade this time, but 
perhaps not before.

-- 
Jim Ohlstein