[twill] html entities and latin-1 problem

Michele Simionato michele.simionato at gmail.com
Thu Mar 9 00:48:24 PST 2006


On 3/5/06, Titus Brown <titus at caltech.edu> wrote:
> Short answer -- unicode support in mechanize is still young ;(.
>
> I have one or two other unicode issues to look at today, too.

There is something wrong with the 0.8.3 release. I was testing a Plone site with
the previous versions of twill and everything was fine. However now I get
an Unicode error when I try to 'formvalue' to that page; the page
starts as follows

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">



<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
      lang="en">

  <head>
    <meta http-equiv="Content-Type"
          content="text/html;charset=utf-8" />

    <title>
        Portal
        &mdash;
        Portal
    </title>

and contains the Plone login form.

I get
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/commands.py",
line 386, in formvalue
    form = browser.get_form(formname)
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/browser.py",
line 254, in get_form
    forms = self._browser.forms()
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/mechanize/_mechanize.py",
line 244, in forms
    return self._factory.forms()
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/utils.py",
line 307, in forms
    self._forms = parse_fn(response, self._encoding)
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/mechanize/_html.py",
line 218, in parse_response
    ignore_errors=self.ignore_errors
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/ClientForm.py",
line 870, in ParseResponse
    encoding,
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/ClientForm.py",
line 906, in ParseFile
    fp.feed(ch)
  File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
    self.goahead(0)
  File "/usr/lib/python2.4/sgmllib.py", line 184, in goahead
    self.handle_entityref(name)
  File "/usr/lib/python2.4/site-packages/twill-0.8.3-py2.4.egg/twill/other_packages/ClientForm.py",
line 667, in handle_entityref
    self.handle_data(table[fullname].encode(self._encoding))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'
in position 0: ordinal not in range(256)

Notice that \u2014 is the em-dash character and twill is using Latin-1
even if the
content-type is utf-8



More information about the twill mailing list