limit_rate based on User-Agent; how to exempt /robots.txt ?

Tue Aug 7 02:45:02 UTC 2018

Hi all, I’ve recently deployed a rate-limiting configuration aimed at protecting myself from spiders.

nginx version: nginx/1.15.1 (RPM from nginx.org)

I did this based on the excellent Nginx blog post at https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the documentation for limit_req and limit_req_zone.

I understand that you can have multiple zones in play, and that the most-restrictive of all matches will apply for any matching request. I want to go the other way though. I want to exempt /robots.txt from being rate limited by spiders.

To put this in context, here is the gist of the relevant config, which aims to implement a caching (and rate-limiting) layer in front of a much more complex request routing layer (httpd).

http {
    map $http_user_agent $user_agent_rate_key {
        default "";
        "~our-crawler" "wanted-robot";
        "~*(bot/|crawler|robot|spider)" "robot";
        "~ScienceBrowser/Nutch" "robot";
        "~Arachni/" "robot";
    }

    limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
    limit_req_status 429;

    server {
        limit_req zone=per_spider_class;

        location / {
            proxy_pass http://routing_layer_http/;
        }
    }
}

Option 1: (working, but has issues)

Should I instead put the limit_req inside the "location / {}" stanza, and have a separate "location /robots.txt {}" (or some generalised form using a map) and not have limit_req inside that stanza

That would mean that any other configuration inside the location stanzas would get duplicated, which would be a manageability concern. I just want to override the limit_req.

    server {
        location /robots.txt {
            proxy_pass http://routing_layer_http/;
        }

        location / {
            limit_req zone=per_spider_class;
            proxy_pass http://routing_layer_http/;
        }
    }

I've tested this, and it works.

Option 2: (working, but has issues)

Should I create a "location /robots.txt {}" stanza that has a limit_req with a high burst, say burst=500? It's not a whitelist, but perhaps something still useful?

But I still end up with replicated location stanzas... I don't think I like this approach.

    server {
        limit_req zone=per_spider_class;

        location /robots.txt {
            limit_req zone=per_spider_class burst=500;
            proxy_pass https://routing_layer_https/;
        }

        location / {
            proxy_pass https://routing_layer_https/;
        }
    }

Option 3: (does not work)

Some other way... perhaps I need to create some map that takes the path and produces a $path_exempt variable, and then somehow use that with the $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key otherwise.

    map $http_user_agent $user_agent_rate_key {
        default "";
        "~otago-crawler" "wanted-robot";
        "~*(bot/|crawler|robot|spider)" "robot";
        "~ScienceBrowser/Nutch" "robot";
        "~Arachni/" "robot";
    }

    map $uri $rate_for_spider_exempting {
        default $user_agent_rate_key;
        "/robots.txt" "";
    }

    #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
    limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;

However, this does not work because the second map is not returning $user_agent_rate_key; the effect is that non-robots are affected (and the load-balancer health-probes start getting rate-limited).

I'm guessing my reasoning of how this works is incorrect, or there is a limitation or some sort of implicit ordering issue.

Option 4: (does not work)

http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate

I see that there is a variable $limit_rate that can be used, and this would seem to be the cleanest, except in testing it doesn't seem to work (still gets 429 responses as a User-Agent that is a bot)

    server {
        limit_req zone=per_spider_class;

        location /robots.txt {
            set $limit_rate 0;
        }

        location / {
            proxy_pass http://routing_layer_http/;
        }
    }

I'm still fairly new with Nginx, so wanting something that decomposes cleanly into an Nginx configuration. I would quite like to be able just have one place where I specify the map of URLs I wish to exempt (I imagine there could be others, such as ~/.well-known/something that could pop up).

Thank you very much for your time.

-- 
Cameron Kerr
Systems Engineer, Information Technology Services
University of Otago