[twill] html entities and latin-1 problem

Michele Simionato michele.simionato at gmail.com
Thu Mar 9 00:48:24 PST 2006

On 3/5/06, Titus Brown <titus at caltech.edu> wrote:
> Short answer -- unicode support in mechanize is still young ;(.
> I have one or two other unicode issues to look at today, too.

There is something wrong with the 0.8.3 release. I was testing a Plone site with
the previous versions of twill and everything was fine. However now I get
an Unicode error when I try to 'formvalue' to that page; the page
starts as follows

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"

    <meta http-equiv="Content-Type"
          content="text/html;charset=utf-8" />


and contains the Plone login form.

I get
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/commands.py",
line 386, in formvalue
    form = browser.get_form(formname)
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/browser.py",
line 254, in get_form
    forms = self._browser.forms()
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/mechanize/_mechanize.py",
line 244, in forms
    return self._factory.forms()
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/utils.py",
line 307, in forms
    self._forms = parse_fn(response, self._encoding)
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/mechanize/_html.py",
line 218, in parse_response
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/ClientForm.py",
line 870, in ParseResponse
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/ClientForm.py",
line 906, in ParseFile
  File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
  File "/usr/lib/python2.4/sgmllib.py", line 184, in goahead
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/ClientForm.py",
line 667, in handle_entityref
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'
in position 0: ordinal not in range(256)

Notice that \u2014 is the em-dash character and twill is using Latin-1
even if the
content-type is utf-8

More information about the twill mailing list