limit_rate based on User-Agent; how to exempt /robots.txt ?
Cameron Kerr
cameron.kerr at otago.ac.nz
Tue Aug 7 02:45:02 UTC 2018
Hi all, I’ve recently deployed a rate-limiting configuration aimed at protecting myself from spiders.
nginx version: nginx/1.15.1 (RPM from nginx.org)
I did this based on the excellent Nginx blog post at https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the documentation for limit_req and limit_req_zone.
I understand that you can have multiple zones in play, and that the most-restrictive of all matches will apply for any matching request. I want to go the other way though. I want to exempt /robots.txt from being rate limited by spiders.
To put this in context, here is the gist of the relevant config, which aims to implement a caching (and rate-limiting) layer in front of a much more complex request routing layer (httpd).
http {
map $http_user_agent $user_agent_rate_key {
default "";
"~our-crawler" "wanted-robot";
"~*(bot/|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}
limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
limit_req_status 429;
server {
limit_req zone=per_spider_class;
location / {
proxy_pass http://routing_layer_http/;
}
}
}
Option 1: (working, but has issues)
Should I instead put the limit_req inside the "location / {}" stanza, and have a separate "location /robots.txt {}" (or some generalised form using a map) and not have limit_req inside that stanza
That would mean that any other configuration inside the location stanzas would get duplicated, which would be a manageability concern. I just want to override the limit_req.
server {
location /robots.txt {
proxy_pass http://routing_layer_http/;
}
location / {
limit_req zone=per_spider_class;
proxy_pass http://routing_layer_http/;
}
}
I've tested this, and it works.
Option 2: (working, but has issues)
Should I create a "location /robots.txt {}" stanza that has a limit_req with a high burst, say burst=500? It's not a whitelist, but perhaps something still useful?
But I still end up with replicated location stanzas... I don't think I like this approach.
server {
limit_req zone=per_spider_class;
location /robots.txt {
limit_req zone=per_spider_class burst=500;
proxy_pass https://routing_layer_https/;
}
location / {
proxy_pass https://routing_layer_https/;
}
}
Option 3: (does not work)
Some other way... perhaps I need to create some map that takes the path and produces a $path_exempt variable, and then somehow use that with the $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key otherwise.
map $http_user_agent $user_agent_rate_key {
default "";
"~otago-crawler" "wanted-robot";
"~*(bot/|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}
map $uri $rate_for_spider_exempting {
default $user_agent_rate_key;
"/robots.txt" "";
}
#limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;
However, this does not work because the second map is not returning $user_agent_rate_key; the effect is that non-robots are affected (and the load-balancer health-probes start getting rate-limited).
I'm guessing my reasoning of how this works is incorrect, or there is a limitation or some sort of implicit ordering issue.
Option 4: (does not work)
http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
I see that there is a variable $limit_rate that can be used, and this would seem to be the cleanest, except in testing it doesn't seem to work (still gets 429 responses as a User-Agent that is a bot)
server {
limit_req zone=per_spider_class;
location /robots.txt {
set $limit_rate 0;
}
location / {
proxy_pass http://routing_layer_http/;
}
}
I'm still fairly new with Nginx, so wanting something that decomposes cleanly into an Nginx configuration. I would quite like to be able just have one place where I specify the map of URLs I wish to exempt (I imagine there could be others, such as ~/.well-known/something that could pop up).
Thank you very much for your time.
--
Cameron Kerr
Systems Engineer, Information Technology Services
University of Otago
More information about the nginx
mailing list