[twill] general query re form parsing.

Titus Brown titus at caltech.edu
Tue Jan 24 11:34:33 PST 2006


-> Hi Titus,

-> On Tue, 2006-01-24 at 12:18, Titus Brown wrote:
-> > Peri's problem with badly formatted pages has raised the question of how
-> > robust or tolerant twill should be to really cruddy HTML.
-> <snip>
-> > I could include BeautifulSoup with twill, too.
-> > 
-> > I could also modify ClientForm to be tolerant to ParseErrors of the sort
-> > that Peri encounters.
-> > 
-> > Right now I'm leaning towards including BS and modifying ClientForm.
-> > Thoughts?
-> 
-> This is the gist of the code I sent you, sorry I never got
-> back to putting it into twill (and haven't checked to see if
-> you got it in there).

Actually, John Lee (mechanize author) beat me to the punch.  It's now
available directly through mechanize!

-> Basically you have two objectives.  For those testing their
-> own site, they need to know they have broken html so that
-> they can fix it.  For those attempting to automate access
-> to other web sites, you can't fix the html, so using
-> BeautifulSoup to fixup the poor html then feeding that
-> to clientform gives you capability to use bad pages.
-> 
-> Perhaps a "strict" or "relaxed" parsing flag to choose
-> the behavior?

Good points.  Unless I can get a patch to toggle behavior like this into
ClientForm, though, I'll be left maintaining an unofficial branch of
ClientForm -- not complaining, just pointing it out ;).

So, lessee: proposed options would be,

 * toggle tidy use (currently possible);
 * toggle BeautifulSoup use;
 * toggle "relaxed" ClientForm parsing;

Goodness, that would solve all sorts of problems, I think ;).

--titus



More information about the twill mailing list