BeautifulSoup
Posted at 4:56pm on Tuesday, November 22nd, 2005So then. The first of an ever increasing number of technical posts I fear… Having chucked in the world of Microsoft Office and mobile telephones for the more edifying (if rather brain melting) world of vim, python, zope and plone I am a) going to have less time to browse cack on the web and b) have considerably more of my brain consumed by technical issues, so it is a fair risk that offmessage will slowly turn into yet another python blog. Sorry about that.
Anyway, new job = new projects and the first was a real stinker. Taking a web site with some 16,000 pages all written in old school non-semantic, non CSS, non XHTML, tables-for-layout based HTML and migrating the content to sexy new fully CSS/XHTML compliant pages. Yeah. Nice. What a start. Particularly as I’ve never actually written python in anger before; a couple of CherryPy test sites and some scripts for stuff inside Plone, but nothing of any size or scale.
Luckily there is a thing called BeautifulSoup that is designed specifically to cope with poorly formed HTML and allow you to grab the content, regardless of the quality, as long as you can find some rule to traverse the mess. Very powerful, very robust and very clever (and with some very nice class names if that’s your cup of tea…)
Two gotchas I’ve found so far… Although a tag object has attributes in the form of dictionary keys they do not support the .has_key() method, so:
>>> from BeautifulSoup import BeautifulSoup
>>> foo = '<a href="thingy.html">linky linky</a>'
>>> bar = BeautifulSoup(foo)
>>> print bar.a['href']
thingy.html
>>> print bar.a.has_key('href')
Null
Took me a fair old while to work that one out, I can say… Patch for BeautifulSoup.py here if you so desire. It doesn’t handle the 'href' in bar.a scenario, but it does provide .has_key() and stop you resorting to KeyError exceptions or any other hacks
The second one is simpler to identify what’s going on, although I don’t have a patch for it (the work-arounds are too easy). In certain instances when it comes across & it will automatically convert the following text into an HTML special character (so in my example R&D turned into R&D;). I couldn’t work out exactly which cases caused this to happen, but it appears to be somewhere around assignment of a string to Tag.string.