Harish Mallipeddi's Blog RSS

Avid Pythonista with a secret love for Erlang.

harish.mallipeddi at gmail

 Photos

 LinkedIn

 Twitter

 Plurk

 Projects

Older posts

Feb
6th
Tue
permalink

Python Screen-scraper in 59 seconds

In response to Ilya’s “Ruby Screen-scraper in 60 seconds” post, here’s my Python version. The script uses ElementSoup. And as Ilya suggested, I used Firebug to grab the XPath. It basically grabs the blog URLs from sgblog.com. It is a little bit complex because the default URLs grabbed from the HTML source actually redirect to the true URLs. There’s some extra code to do just that!


#! /usr/bin/env python
# -*- coding: utf8 -*-

"""
Scrapes blog urls from sgblog.com

@author: Harish Mallipeddi
@organization: http://poundbang.in/
@copyright: Copyright 2005 Harish Mallipeddi
@license: GNU GPLv2 or Later
@contact: harish.mallipeddi@gmail.com
"""

import ElementSoup # download from: http://effbot.org/zone/element-soup.htm
import urllib, urlparse, urllib2

BLOGCOUNT = 50
SGBLOGURL = "http://www.sgblog.com/?number%5B1%5D=" + str(BLOGCOUNT) + "&thefield%5B1%5D=hits"

class MyRedirectHandler(urllib2.HTTPRedirectHandler):
    """
    Prints the new url and raises HTTPError to avoid fetching the page from the new url.
    """
    def redirect_request(self, req, fp, code, msg, headers, newurl):
        print newurl
        raise urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)

def grabBlogUrls():
    """
    Fetches the HTML page. Finds the anchor tags. Grabs the redirected urls.
    """
    html = ElementSoup.parse(urllib.urlopen(SGBLOGURL))
    for anchor in html.findall(".//form/a"):
        href = urlparse.urljoin(SGBLOGURL, anchor.get("href"))
        if not href.startswith(SGBLOGURL):
            request = urllib2.Request(href)
            opener = urllib2.build_opener(MyRedirectHandler)
            try:
                f = opener.open(request)
            except urllib2.HTTPError:
                pass    
    return 

def main():
    grabBlogUrls()    

if __name__ == "__main__":
    main()
Comments
blog comments powered by Disqus