bot-taming: which rules should apply, and in which order?

Sun Jul 29 16:20:03 UTC 2012

I'm attempting to tame (minimize or eliminate) Yandex bot access.

I'd like to understand the application/precedence of the rules I apply.

To my site config I've added

	map $http_user_agent $bad_bot {
		default 0;
		~(Yandex|YandexBot) 1;
	}

	map $http_referrer $bad_referrer {
		default 0;
		~*(yandex) 1;
	}

	valid_referers mydomain.com *.mydomain.com localhost 127.0.0.1
	[::1];

	location / {
		if ($bad_bot)         {return 403;}
		if ($bad_referrer)    {return 403;}
		if ($invalid_referer) {return 444;}
		...
	}

and

	cat /robots.txt
		User-agent: *
		Disallow: /

	cat /robot_ssl.txt
		User-agent: *
		Disallow: /

In my logs I see repeating '444' rejections:

	100.43.83.148 - - [28/Jul/2012:06:02:14 -0500] GET /robots.txt
	HTTP/1.1 "444" 0 "-" "Mozilla/5.0 (compatible; YandexBot/3.0;
	+http://yandex.com/bots)" "-"
	100.43.83.148 - - [28/Jul/2012:06:06:23 -0500] GET /robots.txt
	HTTP/1.1 "444" 0 "-" "Mozilla/5.0 (compatible; YandexBot/3.0;
	+http://yandex.com/bots)" "-"

With my rules above, I'd expect that to be a '403' rejection, as
specified for the "$bad_bot" check.

Why am I seeing the '444' instead of the '403'?