Screen Scraping Websites – Anything I can do?

The term “screen scraping” refers to automated processes that some people run which go off to a website and scans/copies the content to use for themselves. You may have heard about Ryanairs problems with screen scrapers in the past. Essentially, a third party takes advantage of the content that someone produces and publishes on their website for their own benefit.

Another example, as far as I’m concerned, was the recent controversy where a third party company developed an i-Phone application based on the information presented about where bikes were/were not available on the website.

My question, though, is probably more for the more technical minded people amongst you.

I believe that someone has started screen scraping some particular content on the website. It’s the only possibility that I can come up with based on some of the recent web visitor statistics I’ve been noticing.

It’s someone logging on to from an NTL/UPC internet connection, and they hit one particular page approximately once every hour during the day.

I don’t think it can be a person that’s doing this because there is very little content on the one particular page that’s being visited.

I know that Ryanair are talking about taking those that scrape their website to court, but is there something simple that I can do here on my site that would thwart whomever is trying to steal my content?

, ,

3 Responses to Screen Scraping Websites – Anything I can do?

  1. Damien January 23, 2010 at 00:07 #

    Until you have evidence otherwise, I’d assume it’s someone doing this purely for their own benefit/curiosity. (Of course, I don’t know what the page is.) If it’s not hurting you in bandwidth (and sounds like it’s not) I wouldn’t waste time worrying about it. On the other hand, if they are profiting from/republishing the information (Google is an easy way to see this; search for some of the text in quotes””), then there are a variety of technical and nontechnical approaches. Technical ones involve IP address blocking, referer blocking, captchas, etc. (I would read up on these; some have drawbacks and all can potentially be worked around by a determined ‘scraper’.) There are also legal approaches of varying success depending on where the content is hosted.

  2. Richard C January 25, 2010 at 17:37 #

    Is it a bot/spider?
    If so, you could add rules to the robot.txt file to tell it go away .. but it depends on if the bot respects the robot.txt instructions.

  3. Donncha O Caoimh January 31, 2010 at 12:26 #

    Make sure each page on your site has plenty of links back to your own site. At least then when someone steals your content they’ll probably be too lazy to replace the links. You’ll get free link backs to your site.

    Else, don’t worry about unless it impacts the performance of your server. Using your site’s RSS feed it’s much easier to scrape content than visiting a page on your site.

Leave a Reply

Powered by WordPress. Designed by WooThemes

hit counter