Help: How to deal with content scrapers?
Jonathan Vanasco
nginx at 2xlp.com
Thu Apr 23 05:41:07 MSD 2009
Some tips I learned from anti-email spamming:
Sometimes the best thing you can do in this situation isn't to block,
but to identify and throttle + change content.
if you block, they'll try and try again , before swapping ips or
going to the next victim
if you throttle... to something crazy like 1byte/second , most bot
operators won't notice. you'll also end up tying up their connections
you can also send them alternate content... like a mixture of
gibberish and text that identifies them as a spammer or would drop
down the search relevance
On Apr 22, 2009, at 8:17 PM, davidr wrote:
> Guys,
>
> I need some help. In the past few months, the site that I administer
> (it's a large medical non-profit/charity) has been attacked by
> content scraping bots. Basically, these content thieves scrape our
> sites and then repost the information on their own domains and also
> intersperse it with malware, ads. They quite often rank fairly high
> on Google because of it and when a user gets infected, they blame
> us. I've been asking google to delist these sites but that takes
> days/weeks.
>
> These scrapers obviously don't care about robots.txt and they just
> indiscriminately scrape the content and ignore all the rules. I've
> been blocking these scrapers manually but by the time I'm aware of
> the problem, it's already too late. They really inflict a lot of
> damage to our database performance and many users complain that the
> site is too slow at times. When we correlate the data, we see that
> the slowdown occurs while these thieves are scraping the site.
>
> What's the best way to limit the number of requests an IP can make
> in a, say 15 min, time period, for example? Is there a way to block
> them on a webserver (nginx) layer and move it away from an
> application layer since app layer blocking incurs too much of a
> performance hit? I'm looking for something that would simply count
> for the number of requests over a particular time period and just
> add the IP to iptables if it ever crosses the limit.
>
> Any advice is much appreciated!!
>
> Thank you,
>
> Dave
>
> Posted at Nginx Forum: http://forum.nginx.org/read.php?2,1361,1361#msg-1361
>
>
// Jonathan Vanasco
e. jonathan at 2xlp.com
w. http://findmeon.com/user/jvanasco
blog. http://destructuring.net
| - - - - - - - - - -
| Founder/CEO - FindMeOn, Inc.
| FindMeOn.com - The cure for Multiple Web Personality Disorder
| - - - - - - - - - -
| CTO - ArtWeLove, LLC
| ArtWeLove.com - Explore Art On Your Own Terms
| - - - - - - - - - -
| Founder - SyndiClick
| RoadSound.com - Tools for Bands, Stuff for Fans
| - - - - - - - - - -
More information about the nginx
mailing list