bot-taming: which rules should apply, and in which order?
k9157 at operamail.com
k9157 at operamail.com
Sun Jul 29 16:20:03 UTC 2012
I'm attempting to tame (minimize or eliminate) Yandex bot access.
I'd like to understand the application/precedence of the rules I apply.
To my site config I've added
map $http_user_agent $bad_bot {
default 0;
~(Yandex|YandexBot) 1;
}
map $http_referrer $bad_referrer {
default 0;
~*(yandex) 1;
}
valid_referers mydomain.com *.mydomain.com localhost 127.0.0.1
[::1];
location / {
if ($bad_bot) {return 403;}
if ($bad_referrer) {return 403;}
if ($invalid_referer) {return 444;}
...
}
and
cat /robots.txt
User-agent: *
Disallow: /
cat /robot_ssl.txt
User-agent: *
Disallow: /
In my logs I see repeating '444' rejections:
100.43.83.148 - - [28/Jul/2012:06:02:14 -0500] GET /robots.txt
HTTP/1.1 "444" 0 "-" "Mozilla/5.0 (compatible; YandexBot/3.0;
+http://yandex.com/bots)" "-"
100.43.83.148 - - [28/Jul/2012:06:06:23 -0500] GET /robots.txt
HTTP/1.1 "444" 0 "-" "Mozilla/5.0 (compatible; YandexBot/3.0;
+http://yandex.com/bots)" "-"
With my rules above, I'd expect that to be a '403' rejection, as
specified for the "$bad_bot" check.
Why am I seeing the '444' instead of the '403'?
More information about the nginx
mailing list