The virtues of HPricot: scraping DZone

I couple of days ago I was looking at XML parsing solutions available in Ruby. I played a bit with REXML, a conformant but kind of dull XML processor, it works great. When working with less conformant XML-like (like websites) data it might just not be what you're looking for. Luckily ruby demi-god and respected colleague Remco pointed me to HPricot.

Hpricot is a very flexible HTML parser, based on Tanaka Akira's HTree and John Resig's JQuery, but with the scanner recoded in C.

Like most Ruby frameworks HPricot is not hard to install ('gem install hpricot' or 'gem install hpricot --source http://code.whytheluckystiff.net' for the current development version). After this to use HPricot just require 'hpricot'

You can parse files, strings etc. by passing it to HPricot:

RUBY:
  1. doc = Hpricot("<p>A simple <b>test</b> string.</p>")
  2. doc = open("index.html") { |f| Hpricot(f) }
  3. require 'open-uri'
  4. doc = Hpricot(open("http://qwantz.com/"))

If you get hold of your HPricot document you can use XPath or CSS selectors (!!!!!) to traverse it.

Now say I want to scrape the details of the links I posted to DZone recently, which are displayed in a rich HTML page.

The divs containing information about each link can be identified by their CSS class 'linkblock'. If I want to obtain this part of the page you can simply do this:

RUBY:
  1. require 'rubygems'
  2. require 'open-uri'
  3. require 'hpricot'
  4.  
  5. doc = Hpricot(open("http://www.dzone.com/links/users/links/193149.html"))
  6. puts (doc/'.linkblock').first.inner_html

This will print the HTML content of the first node matching the CSS selector. If you don't like the slash '(doc/'.linkblock')' synthax, it's identical to calling 'doc.search('.linkblock')'. Now, let's start scraping some information:

RUBY:
  1. require 'rubygems'
  2. require 'open-uri'
  3. require 'hpricot'
  4.  
  5. doc = Hpricot(open("http://www.dzone.com/links/users/links/193149.html"))
  6.  
  7. (doc/'.linkblock').each do |lb|
  8.     title = (lb/'//a[@rel="bookmark"]').first.inner_html # xpath as well! Get the title of the bookmark
  9.     rating = (lb/'a.upcount').first.inner_html           # the first 'a' element with styleclass 'upcount'
  10.     details = (lb/'p.fineprint')[1].inner_html           # the second 'p' element with styleclass 'fineprint'
  11.     puts "#{title} => rating: #{rating}, details: #{details}"
  12. end

Which will give a list of the links which I recently posted on DZone and some details like ratings:

Using Rails without a database => rating: 4, details: Submitted: Dec 05 / 07:26. Views: 63, Clicks: 24
Using propertyMissing to enhance Date (in Groovy) => rating: 3, details: Submitted: Dec 02 / 16:37. Views: 56, Clicks: 9
Extending LDAP in Java with JLDAP => rating: 2, details: Submitted: Nov 16 / 17:11. Views: 99, Clicks: 30
CPD with maven2 and PMD => rating: 1, details: Submitted: Nov 14 / 15:12. Views: 66, Clicks: 9
multiproject maven2: getting the site to work => rating: 2, details: Submitted: Nov 14 / 14:47. Views: 86, Clicks: 22
Getting started with multiproject maven2 => rating: 12, details: Published: Nov 14 / 19:32. Views: 614, Clicks: 302
Getting started with JBoss Seam (using seam-gen) => rating: 6, details: Published: Oct 20 / 12:23. Views: 961, Clicks: 408
Starting with JBoss Seam (using seam-gen) => rating: 1, details: Submitted: Oct 18 / 14:35. Views: 184, Clicks: 40
Howto: package level annotations => rating: 2, details: Submitted: Oct 13 / 15:28. Views: 155, Clicks: 29
How to start doing behaviour driven development using RSpec => rating: 6, details: Published: Oct 12 / 07:39. Views: 747, Clicks: 393
Testing a Grails taglib using XmlSlurper => rating: 6, details: Submitted: Oct 02 / 15:11. Views: 135, Clicks: 17
LazyList implementation for Googles' Collection Library => rating: 12, details: Published: Oct 04 / 07:48. Views: 639, Clicks: 303
Quick look at the Google Collection Library => rating: 13, details: Published: Oct 01 / 01:00. Views: 1752, Clicks: 1088
Getting started with TDD in Grails => rating: 12, details: Published: Sep 28 / 03:48. Views: 514, Clicks: 125
ActiveForm plugin released => rating: 3, details: Submitted: Sep 25 / 02:33. Views: 97, Clicks: 22
Quick take on Wicket Web Beans => rating: 9, details: Published: Sep 15 / 06:28. Views: 604, Clicks: 249
Unittesting e-mail sending in Spring => rating: 3, details: Submitted: Aug 29 / 03:37. Views: 86, Clicks: 20
Exposing services to SOAP in Grails => rating: 8, details: Published: Aug 31 / 12:12. Views: 559, Clicks: 143

Wow, with only 6 lines of actual code! Great! For more tips and tricks I really recommend the HPricot website, the documentation is superb.


0 Responses to “The virtues of HPricot: scraping DZone”

  1. No Comments

Leave a Reply





About

Welcome to the weblog of Peter Maas. Here you'll find various posts related to stuff I like (like my kids and espresso) and stuff I do (like developing software).

JavaOne 2008 Pictures

IMG_4616.JPG IMG_4568 Greenland graffiti 'boobs' sjoerd op klimrek sea_lion tarte tatin of red unions Obie Fernandez sfeervolverlichte boot IMG_4611.JPG high voltage wires Charles Nutter on JRuby red bar Danielle op de bruggen van Beeld en Geluid IMG_4570 IMG_4614.JPG IMG_4571 trouwauto overburen vlieg Hotel room
View more photos >

Categories



Meld u aan voor PayPal en begin direct met het accepteren van creditcardbetalingen.