How to block fake google spider and fake web browser access?

meteor8488 nginx-forum at nginx.us
Tue May 5 12:38:16 UTC 2015


Hi All,

Recently I found that someguys are trying to mirror my website. They are
doing this in two ways:

1. Pretend to be google spiders . Access logs are as following:

89.85.93.235 - - [05/May/2015:20:23:16 +0800] "GET /robots.txt HTTP/1.0" 444
0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" "66.249.79.138"
79.85.93.235 - - [05/May/2015:20:23:34 +0800] "GET /robots.txt HTTP/1.0" 444
0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" "66.249.79.154"

The http_x_forwarded_for address are google addresses.

2. Pretend to be a normal web browser.


I'm trying to use below configuration to block their access:



For 1 above, I'll check X_forward_for address. If user agent is spider, and
X_forward_for is not null. Then block.
I'm using

map $http_x_forwarded_for $xf {
default 1;
"" 0;
}
map $http_user_agent $fakebots {
default 0;
"~*bot" $xf;
"~*bing" $xf;
"~*search" $xf;
}
if ($fakebots) {
return 444;
} 

With this configuration, it seems the fake google spider can't access the
root of my website. But they can still access my php files, and they can't
access and js or css files. Very strange. I don't know what's wrong.

2. For user-agent who declare they are not spiders. I'll use ngx_lua to
generate a random value and add the value into cookie, and then check
whether they can send this value back or not. If they can't send it back,
then it means that they are robot and block access.

map $http_user_agent $ifbot {
default 0;
"~*Yahoo" 1;
"~*archive" 1;
"~*search" 1;
"~*Googlebot" 1;
"~Mediapartners-Google" 1;
"~*bingbot" 1;
"~*msn" 1;
"~*rogerbot" 3;
"~*ChinasoSpider" 3;
}

if ($ifbot = "0") {
set $humanfilter 1;
}
#below section is to exclude flash upload
if ( $request_uri !~ "~mod\=swfupload\&action\=swfupload" ) {
set $humanfilter "${humanfilter}1";
}

if ($humanfilter = "11"){
rewrite_by_lua '
local random = ngx.var.cookie_random
if(random == nil) then
random = math.random(999999)
end
local token = ngx.md5("hello" .. ngx.var.remote_addr .. random)
if (ngx.var.cookie_token ~= token) then
ngx.header["Set-Cookie"] = {"token=" .. token, "random=" .. random}
return ngx.redirect(ngx.var.scheme .. "://" .. ngx.var.host ..
ngx.var.request_uri)
end
';
} 
But it seems that with above configuration, google bot is also blocked while
it shouldn't.


Any one can help?

Thanks

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,258659,258659#msg-258659



More information about the nginx mailing list