limit_rate based on User-Agent; how to exempt /robots.txt ?

Tue Aug 7 22:27:23 UTC 2018

Hi Maxim, that's very helpful...

> -----Original Message-----
> From: nginx [mailto:nginx-bounces at nginx.org] On Behalf Of Maxim Dounin
> On Tue, Aug 07, 2018 at 02:45:02AM +0000, Cameron Kerr wrote:

> > Option 3: (does not work)

> This approach is expected to work fine (assuming you've used limit_req
> somewhere), and I've just tested the exact configuration snipped provided
> to be sure.  If it doesn't work for you, the problem is likely elsewhere.

Thank you for the confirmation; I've retried it, and testing with ab, it seems to work, so I'm not sure what I was doing wrong previously.

I like the pattern of chaining maps; its nicely functional in my way of thinking.

For the sake of others, my configuration looks like the following:

http {

    map $http_user_agent $user_agent_rate_key {
        default "";
        "~*(bot[/-]|crawler|robot|spider)" "robot";
        "~ScienceBrowser/Nutch" "robot";
        "~Arachni/" "robot";
    }

    map $uri $rate_for_spider_exempting {
        default $user_agent_rate_key;
        "/robots.txt" '';
    }

    limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;

    limit_req_status 429;
    server_tokens off;

    server {
        limit_req zone=per_spider_class;

        location / {
            proxy_pass http://routing_layer_http/;
        }
    }
}

And my testing:

// spider with non-exempted (ie. rate-limited for spiders) URI

$ ab -H 'User-Agent: spider' -n100 https://.../hostname | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests:      100
Failed requests:        98

// spider with exempted (ie. no-rate-limiting for spiders) URI

$ ab -H 'User-Agent: spider' -n100 https://.../robots.txt | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests:      100
Failed requests:        0

// non-spider with exempted (ie. rate-limited for spiders) URI

$ ab -n100 https://.../robots.txt | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests:      100
Failed requests:        0

// non-spider with non-exempted (ie. no-rate-limiting for spiders) URI

$ ab -n100 https://.../hostname | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests:      100
Failed requests:        0

Thanks again for your feedback

Cheers,
Cameron