How to block fake google spider and fake web browser access?

Tue May 5 13:19:52 UTC 2015

Hey, 

Why not just compare their xforward vs connecting ip, if they dont match and its a bot, drop it.

-- 
Payam Chychi
Network Engineer / Security Specialist

On Tuesday, May 5, 2015 at 5:38 AM, meteor8488 wrote:

> Hi All,
> 
> Recently I found that someguys are trying to mirror my website. They are
> doing this in two ways:
> 
> 1. Pretend to be google spiders . Access logs are as following:
> 
> 89.85.93.235 - - [05/May/2015:20:23:16 +0800] "GET /robots.txt HTTP/1.0" 444
> 0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" "66.249.79.138"
> 79.85.93.235 - - [05/May/2015:20:23:34 +0800] "GET /robots.txt HTTP/1.0" 444
> 0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" "66.249.79.154"
> 
> The http_x_forwarded_for address are google addresses.
> 
> 2. Pretend to be a normal web browser.
> 
> 
> I'm trying to use below configuration to block their access:
> 
> 
> 
> For 1 above, I'll check X_forward_for address. If user agent is spider, and
> X_forward_for is not null. Then block.
> I'm using
> 
> map $http_x_forwarded_for $xf {
> default 1;
> "" 0;
> }
> map $http_user_agent $fakebots {
> default 0;
> "~*bot" $xf;
> "~*bing" $xf;
> "~*search" $xf;
> }
> if ($fakebots) {
> return 444;
> } 
> 
> With this configuration, it seems the fake google spider can't access the
> root of my website. But they can still access my php files, and they can't
> access and js or css files. Very strange. I don't know what's wrong.
> 
> 2. For user-agent who declare they are not spiders. I'll use ngx_lua to
> generate a random value and add the value into cookie, and then check
> whether they can send this value back or not. If they can't send it back,
> then it means that they are robot and block access.
> 
> map $http_user_agent $ifbot {
> default 0;
> "~*Yahoo" 1;
> "~*archive" 1;
> "~*search" 1;
> "~*Googlebot" 1;
> "~Mediapartners-Google" 1;
> "~*bingbot" 1;
> "~*msn" 1;
> "~*rogerbot" 3;
> "~*ChinasoSpider" 3;
> }
> 
> if ($ifbot = "0") {
> set $humanfilter 1;
> }
> #below section is to exclude flash upload
> if ( $request_uri !~ "~mod\=swfupload\&action\=swfupload" ) {
> set $humanfilter "${humanfilter}1";
> }
> 
> if ($humanfilter = "11"){
> rewrite_by_lua '
> local random = ngx.var.cookie_random
> if(random == nil) then
> random = math.random(999999)
> end
> local token = ngx.md5("hello" .. ngx.var.remote_addr .. random)
> if (ngx.var.cookie_token ~= token) then
> ngx.header["Set-Cookie"] = {"token=" .. token, "random=" .. random}
> return ngx.redirect(ngx.var.scheme .. "://" .. ngx.var.host ..
> ngx.var.request_uri)
> end
> ';
> } 
> But it seems that with above configuration, google bot is also blocked while
> it shouldn't.
> 
> 
> Any one can help?
> 
> Thanks
> 
> Posted at Nginx Forum: http://forum.nginx.org/read.php?2,258659,258659#msg-258659
> 
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20150505/12499fef/attachment.html>