limit_rate based on User-Agent; how to exempt /robots.txt ?

Tue Aug 7 11:58:22 UTC 2018

Hello!

On Tue, Aug 07, 2018 at 02:45:02AM +0000, Cameron Kerr wrote:

> Hi all, I’ve recently deployed a rate-limiting configuration 
> aimed at protecting myself from spiders.
> 
> nginx version: nginx/1.15.1 (RPM from nginx.org)
> 
> I did this based on the excellent Nginx blog post at 
> https://www.nginx.com/blog/rate-limiting-nginx/ and have 
> consulted the documentation for limit_req and limit_req_zone.
> 
> I understand that you can have multiple zones in play, and that 
> the most-restrictive of all matches will apply for any matching 
> request. I want to go the other way though. I want to exempt 
> /robots.txt from being rate limited by spiders.
> 
> To put this in context, here is the gist of the relevant config, 
> which aims to implement a caching (and rate-limiting) layer in 
> front of a much more complex request routing layer (httpd).
> 
> http {
>     map $http_user_agent $user_agent_rate_key {
>         default "";
>         "~our-crawler" "wanted-robot";
>         "~*(bot/|crawler|robot|spider)" "robot";
>         "~ScienceBrowser/Nutch" "robot";
>         "~Arachni/" "robot";
>     }
> 
>     limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
>     limit_req_status 429;
> 
>     server {
>         limit_req zone=per_spider_class;
> 
>         location / {
>             proxy_pass http://routing_layer_http/;
>         }
>     }
> }
> 
> 
> 
> Option 1: (working, but has issues)
> 
> Should I instead put the limit_req inside the "location / {}" 
> stanza, and have a separate "location /robots.txt {}" (or some 
> generalised form using a map) and not have limit_req inside that 
> stanza
> 
> That would mean that any other configuration inside the location 
> stanzas would get duplicated, which would be a manageability 
> concern. I just want to override the limit_req.
> 
>     server {
>         location /robots.txt {
>             proxy_pass http://routing_layer_http/;
>         }
> 
>         location / {
>             limit_req zone=per_spider_class;
>             proxy_pass http://routing_layer_http/;
>         }
>     }
> 
> I've tested this, and it works.

This is most simple and nginx-way: provide exact configurations in 
particular locations.  And this is what I would recommend to use.

[...]

> Option 3: (does not work)
> 
> Some other way... perhaps I need to create some map that takes 
> the path and produces a $path_exempt variable, and then somehow 
> use that with the $user_agent_rate_key, returning "" when 
> $path_exempt, or $user_agent_rate_key otherwise.
> 
>     map $http_user_agent $user_agent_rate_key {
>         default "";
>         "~otago-crawler" "wanted-robot";
>         "~*(bot/|crawler|robot|spider)" "robot";
>         "~ScienceBrowser/Nutch" "robot";
>         "~Arachni/" "robot";
>     }
> 
>     map $uri $rate_for_spider_exempting {
>         default $user_agent_rate_key;
>         "/robots.txt" "";
>     }
> 
>     #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
>     limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;
> 
> 
> However, this does not work because the second map is not 
> returning $user_agent_rate_key; the effect is that non-robots 
> are affected (and the load-balancer health-probes start getting 
> rate-limited).
> 
> I'm guessing my reasoning of how this works is incorrect, or 
> there is a limitation or some sort of implicit ordering issue.

This approach is expected to work fine (assuming you've used 
limit_req somewhere), and I've just tested the exact configuration 
snipped provided to be sure.  If it doesn't work for you, the 
problem is likely elsewhere.

> Option 4: (does not work)
> 
> http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
> 
> I see that there is a variable $limit_rate that can be used, and 
> this would seem to be the cleanest, except in testing it doesn't 
> seem to work (still gets 429 responses as a User-Agent that is a 
> bot)

The limit_rate directive (and the $limit_rate variable) controls 
bandwidth, and it is completely unrelated to the limit_req 
module.

-- 
Maxim Dounin
http://mdounin.ru/