Friday, February 23, 2007

Using web fetching to create novel websites

Collecting information from a website and presenting it to your visitors in a different, more useful/interesting format can be a valuable service. It can help attract new visitors and get existing visitors coming back for more.

The best way to collect data from a website is to use their web service if they provide one. Google, Yahoo, MSN, amazon, ebay, technorati and numerous others all offer web services. These are well documented interfaces through which you can easily collect data. The problems come when you want to collect data from a site that doesn’t provide a web service.

If a site doesn’t offer a web service or some other well structured version of their data then the only option remaining is web scraping. This can be messy, often requires extensive testing and debugging, and will break if ever the target website changes its design. With all these problems it should only ever be a last resort.

The reason it is so difficult is that you are collecting the data you need from the html used to display the website to each visitor. Each site is unique. Each web scraper is also unique.

If the site we are interested in doesn’t offer a web service, no rss feed and we can’t get the data we need from anywhere else then how do we go about building a web scraper?

The process can be broken down into three parts: get the html of the page, extract the data we need, and then do something with that data.

Depending on how the site you are fetching information from is set up fetching the html could be really very easy or exceedingly complicated. At its simplest all you will need to do is open up a file just as you would a file located on your local server. If you don’t need to log in to view the data you want on the site then it may well be this easy. For your sake I hope it is this simple.

If you need to log in to the site to access the data you need then things can get complicated . . . really complicated. I’m currently trying to collect/develop scripts to fetch the contact lists from webmail services such as hotmail. Here the obstacles you must face include cookies and variables in the URL. These problems can be overcome but you’ll need plenty of time.

Once you have the html for the page you need to clean it up so that you have just the data you need with no extra text. The best way to do this is with regular expressions. By defining patterns for the data you want you can cut out the chaff and just keep what you need. Unless you are extremely comfortable with regular expressions you may find it easier to use several simpler patterns one after each other. Your script may take a little longer to run but it will be far easier to develop.

Finally you want to do something with the data you gather. This may be as simple as storing it in a text file or displaying it immediately to a visitor to your site. You may also decide to do some more complex tasks with the data. The important thing is that once you have the data you have the choice of what to do with it.

Web scraping isn’t easy but the benefits can be considerable. If a web service is available though save yourself some time and use that instead.




http://a1articles.com/article_123924_4.html