[twill] Error with showforms again
Titus Brown
titus at caltech.edu
Tue Dec 13 22:49:03 PST 2005
thanks -- I'll probably go through this & check it in this weekend.
--titus
On Tue, Dec 13, 2005 at 11:28:19PM -0700, William K. Volkman wrote:
-> Hi Titus,
-> On Tue, 2005-12-13 at 15:28 -0800, Titus Brown wrote:
-> > If you just send me the relevant source (not even a functioning patch is
-> > required...) it would help me overcome my inertia. So, whatever you can
-> > send would be great.
-> You really make it easy for a guy to be lazy :-D Well as usual I'm
-> pressed for time so here is the (albeit) hacky version I use, if you
-> want I can clean it up. If you can't wait well...here you have it.
->
-> import re
-> from mechanize import Browser, Link, LinkNotFoundError
-> import ClientForm
-> from BeautifulSoup import BeautifulSoup, Null, Tag
->
-> class MyBrowser(Browser):
-> def _parse_html(self, response):
-> self.form = None
-> self._title = None
-> if not self.viewing_html():
-> # nothing to see here
-> return
-> # set ._forms, ._links
-> try:
-> self._forms = ClientForm.ParseResponse(response)
-> except ClientForm.ParseError, err:
-> print str(err)
-> self._forms = []
-> url = response.geturl()
-> response.seek(0)
-> p = BeautifulSoup(response)
-> forms = p('form')
-> for form in forms:
-> tmp = ClientForm.StringIO()
-> tmp.write(str(form))
-> tmp.seek(0)
-> try:
-> f = ClientForm.ParseFile(tmp, url)
-> except ClientForm.ParseError:
-> continue
-> self._forms.extend(f)
-> response.seek(0)
-> p = BeautifulSoup(response)
-> b = p('base')
-> if len(b) == 0:
-> base = response.geturl()
-> else:
-> base = b[0].get('href')
-> if not base:
-> base = response.geturl()
-> self._links = []
-> for l in p('a', {'href':re.compile('.+')}):
-> url = l.get('href')
-> if not url:
-> # Strange we filtered for it
-> continue
-> if l.string != Null:
-> text = l.string
-> else:
-> text = ''
-> for t in l.contents:
-> if isinstance(t, str):
-> text = text + t.strip()
-> elif isinstance(t, Tag):
-> if t.name == 'img':
-> if t.get('alt'):
-> text = text + t.get('alt')+'[IMG]'
-> else:
-> text = text + '[IMG]'
-> text = text.strip()
-> link = Link(base, url, text, 'a', l.attrs)
-> self._links.append(link)
-> response.seek(0)
->
-> Note that the above contains a implementation of handling the "BASE"
-> tag, which is the reason links are done here instead of letting
-> mechanize take care of them. I made a small patch to BeautifulSoup to
-> recognize the "BASE" tag as an unterminated tag and sent it in but
-> haven't since gone back and checked on it. If needed I can supply
-> that also. Also the "BASE" tag handling above should apply to the
-> form actions, it just wasn't necessary in my case. A more straight
-> forward implementation might be to use BeautifulSoup always instead of
-> trying ClientForm first however I was in a hurry at the time.
->
-> HTH,
-> William.
->
More information about the twill
mailing list