limit_rate based on User-Agent; how to exempt /robots.txt ?

Tue Aug 7 05:56:03 UTC 2018

So it’s very easy to get caught up in he trap if having unrealistic mental models of how we servers work when dealing with web servers. If your host is a recent (< 5 years) single Dickey host then you can probably support 300,000 requests per second fir your robots.txt file. That’s because the file will be served from your Linux file ca he (memory)

Sent from my iPhone

> On Aug 6, 2018, at 10:45 PM, Cameron Kerr <cameron.kerr at otago.ac.nz> wrote:
> 
> Hi all, I’ve recently deployed a rate-limiting configuration aimed at protecting myself from spiders.
> 
> nginx version: nginx/1.15.1 (RPM from nginx.org)
> 
> I did this based on the excellent Nginx blog post at https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the documentation for limit_req and limit_req_zone.
> 
> I understand that you can have multiple zones in play, and that the most-restrictive of all matches will apply for any matching request. I want to go the other way though. I want to exempt /robots.txt from being rate limited by spiders.
> 
> To put this in context, here is the gist of the relevant config, which aims to implement a caching (and rate-limiting) layer in front of a much more complex request routing layer (httpd).
> 
> http {
>    map $http_user_agent $user_agent_rate_key {
>        default "";
>        "~our-crawler" "wanted-robot";
>        "~*(bot/|crawler|robot|spider)" "robot";
>        "~ScienceBrowser/Nutch" "robot";
>        "~Arachni/" "robot";
>    }
> 
>    limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
>    limit_req_status 429;
> 
>    server {
>        limit_req zone=per_spider_class;
> 
>        location / {
>            proxy_pass http://routing_layer_http/;
>        }
>    }
> }
> 
> 
> 
> Option 1: (working, but has issues)
> 
> Should I instead put the limit_req inside the "location / {}" stanza, and have a separate "location /robots.txt {}" (or some generalised form using a map) and not have limit_req inside that stanza
> 
> That would mean that any other configuration inside the location stanzas would get duplicated, which would be a manageability concern. I just want to override the limit_req.
> 
>    server {
>        location /robots.txt {
>            proxy_pass http://routing_layer_http/;
>        }
> 
>        location / {
>            limit_req zone=per_spider_class;
>            proxy_pass http://routing_layer_http/;
>        }
>    }
> 
> I've tested this, and it works.
> 
> 
> Option 2: (working, but has issues)
> 
> Should I create a "location /robots.txt {}" stanza that has a limit_req with a high burst, say burst=500? It's not a whitelist, but perhaps something still useful?
>    
> But I still end up with replicated location stanzas... I don't think I like this approach.
> 
>    server {
>        limit_req zone=per_spider_class;
> 
>        location /robots.txt {
>            limit_req zone=per_spider_class burst=500;
>            proxy_pass https://routing_layer_https/;
>        }
> 
>        location / {
>            proxy_pass https://routing_layer_https/;
>        }
>    }
> 
> 
> Option 3: (does not work)
> 
> Some other way... perhaps I need to create some map that takes the path and produces a $path_exempt variable, and then somehow use that with the $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key otherwise.
> 
>    map $http_user_agent $user_agent_rate_key {
>        default "";
>        "~otago-crawler" "wanted-robot";
>        "~*(bot/|crawler|robot|spider)" "robot";
>        "~ScienceBrowser/Nutch" "robot";
>        "~Arachni/" "robot";
>    }
> 
>    map $uri $rate_for_spider_exempting {
>        default $user_agent_rate_key;
>        "/robots.txt" "";
>    }
> 
>    #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
>    limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;
> 
> 
> However, this does not work because the second map is not returning $user_agent_rate_key; the effect is that non-robots are affected (and the load-balancer health-probes start getting rate-limited).
> 
> I'm guessing my reasoning of how this works is incorrect, or there is a limitation or some sort of implicit ordering issue.
> 
> 
> Option 4: (does not work)
> 
> http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
> 
> I see that there is a variable $limit_rate that can be used, and this would seem to be the cleanest, except in testing it doesn't seem to work (still gets 429 responses as a User-Agent that is a bot)
> 
>    server {
>        limit_req zone=per_spider_class;
> 
>        location /robots.txt {
>            set $limit_rate 0;
>        }
> 
>        location / {
>            proxy_pass http://routing_layer_http/;
>        }
>    }
> 
> 
> I'm still fairly new with Nginx, so wanting something that decomposes cleanly into an Nginx configuration. I would quite like to be able just have one place where I specify the map of URLs I wish to exempt (I imagine there could be others, such as ~/.well-known/something that could pop up).
> 
> Thank you very much for your time.
> 
> -- 
> Cameron Kerr
> Systems Engineer, Information Technology Services
> University of Otago
> 
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx