[twill] A fix, a fix, to BeautifulSoup 3.0 parsing

Tue Oct 17 16:08:32 PDT 2006

On Tue, 17 Oct 2006, Titus Brown wrote:
[...]
> ---
> File "/disk/u/t/dev/twill/twill/other_packages/BeautifulSoup.py", line
> 1057, in endData
>    currentData = ''.join(self.currentData)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
> ---
>
> and ended up changing the BeautifulSoup code to to do a
>
>    currentData = ''.join(str(self.currentData))
>                          ^^^
>
> I don't understand unicode well enough to know whether or not this is
> going to cause huge problems, but it was the only way to get mechanize
> and BS 3.0 to play nice.

Why?

It may well be that the BS support in mechanize needs major surgery to 
work with BS 3 (I certainly concuded I wouldn't do it myself).  I'm sure 
it gets encoding stuff wrong with BS 2.0 also, in fact.  It's certainly 
true that mechanize should be fixed to use unicode strings.  And my simple 
BS 2 hack, deriving from BS classes, might not work -- e.g. it might be 
best to write a new class that does the work of _AbstractFormParser & 
friends.  But your hack is certainly not the only way to get mechanize to 
work with BS 3.0.

Perhaps you know all this already, but: It looks like self.currentData 
contains a mixture of bytestrings and unicode strings, and you (by means 
of failing to update the BeautifulSoup support of mechanize) are failing 
to explicitly decode the bytestrings using the appropriate encoding before 
calling ''.join (and probably even before the strings even get into 
.currentData in the first place -- but I don't recall how BS 3 works). 
Almost always ('almost' because string formatting is a weird case), 
whenever bytestrings and unicode strings meet, the result is a unicode 
string.  When that happens, Python decodes the bytestring(s) using the 
value of sys.getdefaultencoding() as the encoding, which is usually going 
to be the wrong thing to do for web pages.

Note I've bundled BS 2 with mechanize 0.1.4b, BTW.

John