[crit] 16665#0 unlink()

Mon May 6 13:54:19 UTC 2013

Hello!

On Mon, May 06, 2013 at 09:01:45AM -0400, Jim Ohlstein wrote:

> On 05/05/13 16:32, Maxim Dounin wrote:
> >Hello!
> >
> >On Sat, May 04, 2013 at 07:08:55PM -0400, Jim Ohlstein wrote:
> >
> >[...]
> >
> >>I have just seen a similar situation using fastcgi cache. In my case
> >>I am using the same cache (but only one cache) for several
> >>server/location blocks. The system is a fairly basic nginx set up
> >>with four upstream fastcgi servers and ip hash. The returned content
> >>is cached locally by nginx. The cache is rather large but I wouldn't
> >>think this would be the cause.
> >
> >[...]
> >
> >>     fastcgi_cache_path /var/nginx/fcgi_cache levels=1:2
> >>keys_zone=one:512m max_size=250g inactive=24h;
> >
> >[...]
> >
> >>The other sever/location blocks are pretty much identical insofar as
> >>fastcgi and cache are concerned.
> >>
> >>When I upgraded nginx using the "on the fly" binary upgrade method,
> >>I saw almost 400,000 lines in the error log that looked like this:
> >>
> >>2013/05/04 17:54:25 [crit] 65304#0: unlink()
> >>"/var/nginx/fcgi_cache/7/2e/899bc269a74afe6e0ad574eacde4e2e7" failed
> >>(2: No such file or directory)
> >
> >[...]
> >
> >After binary upgrade there are two cache zones - one in old nginx,
> >and another one in new nginx (much like in originally posted
> >configuration).  This may cause such errors if e.g. a cache file
> >is removed by old nginx, and new nginx fails to remove the file
> >shortly after.
> >
> >The 400k lines is a bit too many though.  You may want to check
> >that the cache wasn't just removed by some (package?) script
> >during the upgrade process.  Alternatively, it might indicate that
> >you let old and new processes to coexist for a long time.
> 
> I hadn't considered that there are two zones during that short time.
> Thanks for pointing that out.
> 
> To my knowledge, there are no scripts or packages which remove files
> from the cache, or the entire cache. A couple of minutes after this
> occurred there were a bit under 1.4 million items in the cache and
> it was "full" at 250 GB. I did look in a few sub-directories at the
> time, and most of the items were time stamped from before this
> started so clearly the entire cache was not removed. During the time
> period these entries were made in the error log, and in the two
> minutes after, access log entries show the expected ratio of "HIT"
> and "MISS" entries which further supports your point below that
> these are harmless (although I don't really believe that I have a
> cause).
> 
> I'm not sure what you mean by a "long time" but all of these entries
> are time stamped over over roughly two and a half minutes.

Is it ok in your setup that 400k cache items are removed/expired 
from cache in two minutes?  If yes, then it's probably ok.

> >On the other hand, as discussed many times - such errors are more
> >or less harmless as soon as it's clear what caused cache files to
> >be removed.  At worst they indicate that information in a cache
> >zone isn't correct and max_size might not be maintained properly,
> >and eventually nginx will self-heal the cache zone.  It probably
> >should be logged at [error] or even [warn] level instead.
> >
> 
> Why would max_size not be maintained properly? Isn't that the
> responsibility cache manager process? Are there known issues/bugs?

Cache manager process uses the same shared memory zone to maintain 
max_size.  And if nginx thinks a cache file is here, but the file 
was in fact already deleted (this is why alerts in question 
appear) - total size of the cache as recorded in the shared memory 
will be incorrect.  As a result cache manager will delete some 
extra files to keep (incorrect) size under max_size.

In a worst case cache size will be again correct after inactive= 
time passes after cache files were deleted.

-- 
Maxim Dounin
http://nginx.org/en/donation.html