This was more of a test to see how easy it would be to scrape some data using Ruby as I usually use PHP to do this for me. However when using PHP it always seems to be a mess of explodes and regexp’s to get what I want. So I wanted to see how other languages do it.

So first things first I found out which packages/libraries were around for Ruby, and admittedly there’s quite a few, however upon first look what appear to be the better ones don’t work out of the box. Which is a big shame, as I spent a good few hours trying to get scRUBYt! working on my Windows XP development machine, but in the end I had to admit defeat. It just wasn’t going to work properly even though I had followed every guide you could find to get it working in a Windows environment. I was a bit disheartened as this appeared to be the best one out there as its first example was exactly what I was looking for.

Alas, I moved on, had a quick look at Hpricot, but it didn’t seem to do what I wanted easily. After trying to find some examples of how they all worked I found srcAPI, which was very similar to scRUBYt however no matter how much I tried, I could only find one example, which was its own eBay one. It had a nice installation, just a gem, which worked first time (which one usually expects), from there I tried their example. I had read on their website that it uses Tidy to do the HTML cleaning, but you could use it without that if you told it to use the built in one, however it warns you that this should only really be used for testing, and and that was what I was doing I told it exactly that.

Scraper::Base.parser :html_parser

To cut to the chase, nearly 3hours, much googling and frustration later and the simplest of process’s it was refusing to scrape part of a table I already had. Telling it to use the built in one instead of Tidy was my downfall, the second I switched it back to Tidy (and a quick re-ordering of the loading code to stop it trying to load the linux version first) and it was working perfectly. This was someones comment about it all…

Phases of scrAPI usage:

1. Elation – Wow, this is so easy and powerful. I’m gonna scrape the world!!!!!

2. Despair – What the hell is the syntax for the selectors, I’m so confused, and there are no docs

3. Elation – scrAPI has great test coverage, you can learn everything you need to know about the selectors from the tests.

Couldn’t agree more, although instead of getting help from the test cases I got help within a scrAPI cheat sheet I found. Without that I would have probably given up and tried the next one in line.

So I’m now testing and doing more complex scraping with srcAPI, but for anyone who is a beginner to Ruby you may find this task quite challenging and probably beyond your scope of knowledge to begin with. The lack of documentation and examples for this made it increasingly hard, and most was done with guess work on how it was interpreting the HTML source code.

Bottom line, if you can get scRUBYt! working, go with that, it’s very powerful and from the examples I have seen, very programmer friendly and will allow you to get to the data you want fast. But if you can’t get it on the go, then there’s many others available including many I haven’t mentioned, but Google knows all! (apart from examples of srcAPI :P )