Help: How to deal with content scrapers?

Thu Apr 23 04:17:41 MSD 2009

Guys,

I need some help. In the past few months, the site that I administer (it's a large medical non-profit/charity) has been attacked by content scraping bots. Basically, these content thieves scrape our sites and then repost the information on their own domains and also intersperse it with malware, ads. They quite often rank fairly high on Google because of it and when a user gets infected, they blame us. I've been asking google to delist these sites but that takes days/weeks.

These scrapers obviously don't care about robots.txt and they just indiscriminately scrape the content and ignore all the rules. I've been blocking these scrapers manually but by the time I'm aware of the problem, it's already too late. They really inflict a lot of damage to our database performance and many users complain that the site is too slow at times. When we correlate the data, we see that the slowdown occurs while these thieves are scraping the site.

What's the best way to limit the number of requests an IP can make in a, say 15 min, time period, for example? Is there a way to block them on a webserver (nginx) layer and move it away from an application layer since app layer blocking incurs too much of a performance hit? I'm looking for something that would simply count for the number of requests over a particular time period and just add the IP to iptables if it ever crosses the limit.

Any advice is much appreciated!!

Thank you,

Dave

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,1361,1361#msg-1361