<div id="reply-content">
Hey,
</div><div id="reply-content"><br></div><div id="reply-content">Why not just compare their xforward vs connecting ip, if they dont match and its a bot, drop it.</div><div id="reply-content"><br></div><div id="reply-content"><br></div>
<div id="8E63405F9D8241C18F251763C84C6E93"><div><br></div>-- <br>Payam Chychi<br>Network Engineer / Security Specialist<div><br></div></div>
<p style="color: #A0A0A8;">On Tuesday, May 5, 2015 at 5:38 AM, meteor8488 wrote:</p>
<blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px;">
<div id="quoted-message-content"><div><div>Hi All,</div><div><br></div><div>Recently I found that someguys are trying to mirror my website. They are</div><div>doing this in two ways:</div><div><br></div><div>1. Pretend to be google spiders . Access logs are as following:</div><div><br></div><div>89.85.93.235 - - [05/May/2015:20:23:16 +0800] "GET /robots.txt HTTP/1.0" 444</div><div>0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1;</div><div>+http://www.google.com/bot.html)" "66.249.79.138"</div><div>79.85.93.235 - - [05/May/2015:20:23:34 +0800] "GET /robots.txt HTTP/1.0" 444</div><div>0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1;</div><div>+http://www.google.com/bot.html)" "66.249.79.154"</div><div><br></div><div>The http_x_forwarded_for address are google addresses.</div><div><br></div><div>2. Pretend to be a normal web browser.</div><div><br></div><div><br></div><div>I'm trying to use below configuration to block their access:</div><div><br></div><div><br></div><div><br></div><div>For 1 above, I'll check X_forward_for address. If user agent is spider, and</div><div>X_forward_for is not null. Then block.</div><div>I'm using</div><div><br></div><div>map $http_x_forwarded_for $xf {</div><div>default 1;</div><div>"" 0;</div><div>}</div><div>map $http_user_agent $fakebots {</div><div>default 0;</div><div>"~*bot" $xf;</div><div>"~*bing" $xf;</div><div>"~*search" $xf;</div><div>}</div><div>if ($fakebots) {</div><div>return 444;</div><div>} </div><div><br></div><div>With this configuration, it seems the fake google spider can't access the</div><div>root of my website. But they can still access my php files, and they can't</div><div>access and js or css files. Very strange. I don't know what's wrong.</div><div><br></div><div>2. For user-agent who declare they are not spiders. I'll use ngx_lua to</div><div>generate a random value and add the value into cookie, and then check</div><div>whether they can send this value back or not. If they can't send it back,</div><div>then it means that they are robot and block access.</div><div><br></div><div>map $http_user_agent $ifbot {</div><div>default 0;</div><div>"~*Yahoo" 1;</div><div>"~*archive" 1;</div><div>"~*search" 1;</div><div>"~*Googlebot" 1;</div><div>"~Mediapartners-Google" 1;</div><div>"~*bingbot" 1;</div><div>"~*msn" 1;</div><div>"~*rogerbot" 3;</div><div>"~*ChinasoSpider" 3;</div><div>}</div><div><br></div><div>if ($ifbot = "0") {</div><div>set $humanfilter 1;</div><div>}</div><div>#below section is to exclude flash upload</div><div>if ( $request_uri !~ "~mod\=swfupload\&action\=swfupload" ) {</div><div>set $humanfilter "${humanfilter}1";</div><div>}</div><div><br></div><div>if ($humanfilter = "11"){</div><div>rewrite_by_lua '</div><div>local random = ngx.var.cookie_random</div><div>if(random == nil) then</div><div>random = math.random(999999)</div><div>end</div><div>local token = ngx.md5("hello" .. ngx.var.remote_addr .. random)</div><div>if (ngx.var.cookie_token ~= token) then</div><div>ngx.header["Set-Cookie"] = {"token=" .. token, "random=" .. random}</div><div>return ngx.redirect(ngx.var.scheme .. "://" .. ngx.var.host ..</div><div>ngx.var.request_uri)</div><div>end</div><div>';</div><div>} </div><div>But it seems that with above configuration, google bot is also blocked while</div><div>it shouldn't.</div><div><br></div><div><br></div><div>Any one can help?</div><div><br></div><div>Thanks</div><div><br></div><div>Posted at Nginx Forum: http://forum.nginx.org/read.php?2,258659,258659#msg-258659</div><div><br></div><div>_______________________________________________</div><div>nginx mailing list</div><div>nginx@nginx.org</div><div>http://mailman.nginx.org/mailman/listinfo/nginx</div></div></div>
</blockquote>
<div>
<br>
</div>