[twill] A fix, a fix, to BeautifulSoup 3.0 parsing

Titus Brown titus at caltech.edu
Tue Oct 17 19:03:46 PDT 2006


On Tue, Oct 17, 2006 at 11:08:32PM +0000, John J Lee wrote:
-> On Tue, 17 Oct 2006, Titus Brown wrote:
-> [...]
-> > ---
-> > File "/disk/u/t/dev/twill/twill/other_packages/BeautifulSoup.py", line
-> > 1057, in endData
-> >    currentData = ''.join(self.currentData)
-> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
-> > ---
-> >
-> > and ended up changing the BeautifulSoup code to to do a
-> >
-> >    currentData = ''.join(str(self.currentData))
-> >                          ^^^
-> >
-> > I don't understand unicode well enough to know whether or not this is
-> > going to cause huge problems, but it was the only way to get mechanize
-> > and BS 3.0 to play nice.
-> 
-> Why?

Without it, I got that error on many pages.  It was never clear to me
why, but I am guessing that there was a bad interaction between
mechanize or HTML and BeautifulSoup's defaults.  Basically BS kept on
trying to treat things as ascii, even when the encoding *supplied* to
BS was latin-1 or something else.

After looking through the code, I connected the encoding as specified
by mechanize to the BeautifulSoup setup, and it made no difference --
unless I also did the 'str' call.  As I have no claim to understanding
how mechanize deals with unicode, I'm not sure how it all works.  But
it *does* seem to work ok.

cheers,
--titus



More information about the twill mailing list