limit_rate based on User-Agent; how to exempt /robots.txt ?
Maxim Dounin
mdounin at mdounin.ru
Tue Aug 7 11:58:22 UTC 2018
Hello!
On Tue, Aug 07, 2018 at 02:45:02AM +0000, Cameron Kerr wrote:
> Hi all, I’ve recently deployed a rate-limiting configuration
> aimed at protecting myself from spiders.
>
> nginx version: nginx/1.15.1 (RPM from nginx.org)
>
> I did this based on the excellent Nginx blog post at
> https://www.nginx.com/blog/rate-limiting-nginx/ and have
> consulted the documentation for limit_req and limit_req_zone.
>
> I understand that you can have multiple zones in play, and that
> the most-restrictive of all matches will apply for any matching
> request. I want to go the other way though. I want to exempt
> /robots.txt from being rate limited by spiders.
>
> To put this in context, here is the gist of the relevant config,
> which aims to implement a caching (and rate-limiting) layer in
> front of a much more complex request routing layer (httpd).
>
> http {
> map $http_user_agent $user_agent_rate_key {
> default "";
> "~our-crawler" "wanted-robot";
> "~*(bot/|crawler|robot|spider)" "robot";
> "~ScienceBrowser/Nutch" "robot";
> "~Arachni/" "robot";
> }
>
> limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
> limit_req_status 429;
>
> server {
> limit_req zone=per_spider_class;
>
> location / {
> proxy_pass http://routing_layer_http/;
> }
> }
> }
>
>
>
> Option 1: (working, but has issues)
>
> Should I instead put the limit_req inside the "location / {}"
> stanza, and have a separate "location /robots.txt {}" (or some
> generalised form using a map) and not have limit_req inside that
> stanza
>
> That would mean that any other configuration inside the location
> stanzas would get duplicated, which would be a manageability
> concern. I just want to override the limit_req.
>
> server {
> location /robots.txt {
> proxy_pass http://routing_layer_http/;
> }
>
> location / {
> limit_req zone=per_spider_class;
> proxy_pass http://routing_layer_http/;
> }
> }
>
> I've tested this, and it works.
This is most simple and nginx-way: provide exact configurations in
particular locations. And this is what I would recommend to use.
[...]
> Option 3: (does not work)
>
> Some other way... perhaps I need to create some map that takes
> the path and produces a $path_exempt variable, and then somehow
> use that with the $user_agent_rate_key, returning "" when
> $path_exempt, or $user_agent_rate_key otherwise.
>
> map $http_user_agent $user_agent_rate_key {
> default "";
> "~otago-crawler" "wanted-robot";
> "~*(bot/|crawler|robot|spider)" "robot";
> "~ScienceBrowser/Nutch" "robot";
> "~Arachni/" "robot";
> }
>
> map $uri $rate_for_spider_exempting {
> default $user_agent_rate_key;
> "/robots.txt" "";
> }
>
> #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
> limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;
>
>
> However, this does not work because the second map is not
> returning $user_agent_rate_key; the effect is that non-robots
> are affected (and the load-balancer health-probes start getting
> rate-limited).
>
> I'm guessing my reasoning of how this works is incorrect, or
> there is a limitation or some sort of implicit ordering issue.
This approach is expected to work fine (assuming you've used
limit_req somewhere), and I've just tested the exact configuration
snipped provided to be sure. If it doesn't work for you, the
problem is likely elsewhere.
> Option 4: (does not work)
>
> http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
>
> I see that there is a variable $limit_rate that can be used, and
> this would seem to be the cleanest, except in testing it doesn't
> seem to work (still gets 429 responses as a User-Agent that is a
> bot)
The limit_rate directive (and the $limit_rate variable) controls
bandwidth, and it is completely unrelated to the limit_req
module.
--
Maxim Dounin
http://mdounin.ru/
More information about the nginx
mailing list