web analytics

Screen Scraping Websites – Anything I can do?

The term “screen scraping” refers to automated processes that some people run which go off to a website and scans/copies the content to use for themselves. You may have heard about Ryanairs problems with screen scrapers in the past. Essentially, a third party takes advantage of the content that someone produces and publishes on their website for their own benefit.

Another example, as far as I’m concerned, was the recent controversy where a third party company developed an i-Phone application based on the information presented about where bikes were/were not available on the DublinBikes.ie website.

My question, though, is probably more for the more technical minded people amongst you.

I believe that someone has started screen scraping some particular content on the ValueIreland.com website. It’s the only possibility that I can come up with based on some of the recent web visitor statistics I’ve been noticing.

It’s someone logging on to ValueIreland.com from an NTL/UPC internet connection, and they hit one particular page approximately once every hour during the day.

I don’t think it can be a person that’s doing this because there is very little content on the one particular page that’s being visited.

I know that Ryanair are talking about taking those that scrape their website to court, but is there something simple that I can do here on my site that would thwart whomever is trying to steal my content?

3 comments On Screen Scraping Websites – Anything I can do?

  • Until you have evidence otherwise, I’d assume it’s someone doing this purely for their own benefit/curiosity. (Of course, I don’t know what the page is.) If it’s not hurting you in bandwidth (and sounds like it’s not) I wouldn’t waste time worrying about it. On the other hand, if they are profiting from/republishing the information (Google is an easy way to see this; search for some of the text in quotes””), then there are a variety of technical and nontechnical approaches. Technical ones involve IP address blocking, referer blocking, captchas, etc. (I would read up on these; some have drawbacks and all can potentially be worked around by a determined ‘scraper’.) There are also legal approaches of varying success depending on where the content is hosted.

  • Is it a bot/spider?
    If so, you could add rules to the robot.txt file to tell it go away .. but it depends on if the bot respects the robot.txt instructions.

  • Make sure each page on your site has plenty of links back to your own site. At least then when someone steals your content they’ll probably be too lazy to replace the links. You’ll get free link backs to your site.

    Else, don’t worry about unless it impacts the performance of your server. Using your site’s RSS feed it’s much easier to scrape content than visiting a page on your site.

Leave a reply:

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Site Footer

Copyright 2003-2018 ValueIreland.com