[twill] A fix, a fix, to BeautifulSoup 3.0 parsing
Titus Brown
titus at caltech.edu
Tue Oct 17 19:03:46 PDT 2006
On Tue, Oct 17, 2006 at 11:08:32PM +0000, John J Lee wrote:
-> On Tue, 17 Oct 2006, Titus Brown wrote:
-> [...]
-> > ---
-> > File "/disk/u/t/dev/twill/twill/other_packages/BeautifulSoup.py", line
-> > 1057, in endData
-> > currentData = ''.join(self.currentData)
-> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
-> > ---
-> >
-> > and ended up changing the BeautifulSoup code to to do a
-> >
-> > currentData = ''.join(str(self.currentData))
-> > ^^^
-> >
-> > I don't understand unicode well enough to know whether or not this is
-> > going to cause huge problems, but it was the only way to get mechanize
-> > and BS 3.0 to play nice.
->
-> Why?
Without it, I got that error on many pages. It was never clear to me
why, but I am guessing that there was a bad interaction between
mechanize or HTML and BeautifulSoup's defaults. Basically BS kept on
trying to treat things as ascii, even when the encoding *supplied* to
BS was latin-1 or something else.
After looking through the code, I connected the encoding as specified
by mechanize to the BeautifulSoup setup, and it made no difference --
unless I also did the 'str' call. As I have no claim to understanding
how mechanize deals with unicode, I'm not sure how it all works. But
it *does* seem to work ok.
cheers,
--titus
More information about the twill
mailing list