[twill] A fix, a fix, to BeautifulSoup 3.0 parsing
John J Lee
jjl at pobox.com
Tue Oct 17 16:08:32 PDT 2006
On Tue, 17 Oct 2006, Titus Brown wrote:
> File "/disk/u/t/dev/twill/twill/other_packages/BeautifulSoup.py", line
> 1057, in endData
> currentData = ''.join(self.currentData)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
> and ended up changing the BeautifulSoup code to to do a
> currentData = ''.join(str(self.currentData))
> I don't understand unicode well enough to know whether or not this is
> going to cause huge problems, but it was the only way to get mechanize
> and BS 3.0 to play nice.
It may well be that the BS support in mechanize needs major surgery to
work with BS 3 (I certainly concuded I wouldn't do it myself). I'm sure
it gets encoding stuff wrong with BS 2.0 also, in fact. It's certainly
true that mechanize should be fixed to use unicode strings. And my simple
BS 2 hack, deriving from BS classes, might not work -- e.g. it might be
best to write a new class that does the work of _AbstractFormParser &
friends. But your hack is certainly not the only way to get mechanize to
work with BS 3.0.
Perhaps you know all this already, but: It looks like self.currentData
contains a mixture of bytestrings and unicode strings, and you (by means
of failing to update the BeautifulSoup support of mechanize) are failing
to explicitly decode the bytestrings using the appropriate encoding before
calling ''.join (and probably even before the strings even get into
.currentData in the first place -- but I don't recall how BS 3 works).
Almost always ('almost' because string formatting is a weird case),
whenever bytestrings and unicode strings meet, the result is a unicode
string. When that happens, Python decodes the bytestring(s) using the
value of sys.getdefaultencoding() as the encoding, which is usually going
to be the wrong thing to do for web pages.
Note I've bundled BS 2 with mechanize 0.1.4b, BTW.
More information about the twill