[twill] Error with showforms again

William K. Volkman wkvsf at users.sourceforge.net
Tue Dec 13 22:28:19 PST 2005


Hi Titus,
On Tue, 2005-12-13 at 15:28 -0800, Titus Brown wrote:
> If you just send me the relevant source (not even a functioning patch is
> required...) it would help me overcome my inertia.  So, whatever you can
> send would be great.
You really make it easy for a guy to be lazy :-D  Well as usual I'm
pressed for time so here is the (albeit) hacky version I use, if you
want I can clean it up.  If you can't wait well...here you have it.

import re
from mechanize import Browser, Link, LinkNotFoundError
import ClientForm
from BeautifulSoup import BeautifulSoup, Null, Tag

class MyBrowser(Browser):
    def _parse_html(self, response):
        self.form = None
        self._title = None
        if not self.viewing_html():
            # nothing to see here
            return
        # set ._forms, ._links
        try:
            self._forms = ClientForm.ParseResponse(response)
        except ClientForm.ParseError, err:
            print str(err)
            self._forms = []
            url = response.geturl()
            response.seek(0)
            p = BeautifulSoup(response)
            forms = p('form')
            for form in forms:
                tmp = ClientForm.StringIO()
                tmp.write(str(form))
                tmp.seek(0)
                try:
                    f = ClientForm.ParseFile(tmp, url)
                except ClientForm.ParseError:
                    continue
                self._forms.extend(f)
        response.seek(0)
        p = BeautifulSoup(response)
        b = p('base')
        if len(b) == 0:
            base = response.geturl()
        else:
            base = b[0].get('href')
            if not base:
                base = response.geturl()
        self._links = []
        for l in p('a', {'href':re.compile('.+')}):
            url = l.get('href')
            if not url:
                # Strange we filtered for it
                continue
            if l.string != Null:
                text = l.string
            else:
                text = ''
                for t in l.contents:
                    if isinstance(t, str):
                        text = text + t.strip()
                    elif isinstance(t, Tag):
                        if t.name == 'img':
                            if t.get('alt'):
                                text = text + t.get('alt')+'[IMG]'
                            else:
                                text = text + '[IMG]'
                text = text.strip()
            link = Link(base, url, text, 'a', l.attrs)
            self._links.append(link)
        response.seek(0)

Note that the above contains a implementation of handling the "BASE"
tag, which is the reason links are done here instead of letting
mechanize take care of them.  I made a small patch to BeautifulSoup to
recognize the "BASE" tag as an unterminated tag and sent it in but
haven't since gone back and checked on it.  If needed I can supply
that also.  Also the "BASE" tag handling above should apply to the
form actions, it just wasn't necessary in my case.  A more straight
forward implementation might be to use BeautifulSoup always instead of
trying ClientForm first however I was in a hurry at the time.

HTH,
William.




More information about the twill mailing list