Help: How to deal with content scrapers?

Jonathan Vanasco nginx at 2xlp.com
Thu Apr 23 05:41:07 MSD 2009


Some tips I learned from anti-email spamming:

	Sometimes the best thing you can do in this situation isn't to block,  
but to identify and throttle + change content.
		if you block, they'll try and try again , before swapping ips or  
going to the next victim
		if you throttle... to something crazy like 1byte/second , most bot  
operators won't notice.  you'll also end up tying up their connections
		you can also send them alternate content... like a mixture of  
gibberish and text that identifies them as a spammer or would drop  
down the search relevance



On Apr 22, 2009, at 8:17 PM, davidr wrote:

> Guys,
>
> I need some help. In the past few months, the site that I administer  
> (it's a large medical non-profit/charity) has been attacked by  
> content scraping bots. Basically, these content thieves scrape our  
> sites and then repost the information on their own domains and also  
> intersperse it with malware, ads. They quite often rank fairly high  
> on Google because of it and when a user gets infected, they blame  
> us. I've been asking google to delist these sites but that takes  
> days/weeks.
>
> These scrapers obviously don't care about robots.txt and they just  
> indiscriminately scrape the content and ignore all the rules. I've  
> been blocking these scrapers manually but by the time I'm aware of  
> the problem, it's already too late. They really inflict a lot of  
> damage to our database performance and many users complain that the  
> site is too slow at times. When we correlate the data, we see that  
> the slowdown occurs while these thieves are scraping the site.
>
> What's the best way to limit the number of requests an IP can make  
> in a, say 15 min, time period, for example? Is there a way to block  
> them on a webserver (nginx) layer and move it away from an  
> application layer since app layer blocking incurs too much of a  
> performance hit? I'm looking for something that would simply count  
> for the number of requests over a particular time period and just  
> add the IP to iptables if it ever crosses the limit.
>
> Any advice is much appreciated!!
>
> Thank you,
>
> Dave
>
> Posted at Nginx Forum: http://forum.nginx.org/read.php?2,1361,1361#msg-1361
>
>

// Jonathan Vanasco

e. jonathan at 2xlp.com
w. http://findmeon.com/user/jvanasco
blog. http://destructuring.net

|   -   -   -   -   -   -   -   -   -   -
|   Founder/CEO - FindMeOn, Inc.
|      FindMeOn.com - The cure for Multiple Web Personality Disorder
|   -   -   -   -   -   -   -   -   -   -
|   CTO - ArtWeLove, LLC
|      ArtWeLove.com - Explore Art On Your Own Terms
|   -   -   -   -   -   -   -   -   -   -
|   Founder - SyndiClick
|      RoadSound.com - Tools for Bands, Stuff for Fans
|   -   -   -   -   -   -   -   -   -   -








More information about the nginx mailing list