[twill] check_links doesn't follow

Titus Brown titus at caltech.edu
Mon Jan 29 11:04:28 PST 2007

On Mon, Jan 29, 2007 at 07:46:42PM +0100, Lars Stavholm wrote:
-> Titus Brown wrote:
-> > Hi, Lars,
-> > 
-> > neither tidy nor BeautifulSoup like the conditionals in the HTML on this
-> > page; the culprit on jonitec.se appears to be this:
-> > 
-> > ===
-> > 
-> > <!-- CorrectPNG! Module : compliance patch for microsoft browsers -->
-> > <!--[if gte IE 5.5000]>
-> > <!--[if lte IE 7]><script language="JavaScript" src="http://www.jonitec.se/mambots/system/botcorrectpng/correctpng.js"></script><![endif]-->
-> > <![endif]-->
-> > 
-> > ===
-> > 
-> > That is, if I remove that from the page, showlinks works fine. The
-> > <![endif]--> is specifically what's causing the problem; if you put
-> > <!--[endif]--> link parsing works.
-> > 
-> > Do you have any thoughts on how to deal with this?  It's obviously
-> > incorrect HTML but it shouldn't be breaking things this badly ;).
-> Thanks Titus! Problem solved in the sense that I fixed the parsing
-> problem by simply making sure that correct HTML is produced following
-> your advice and findings. I'll remember to try and make sure that I
-> have correct HTML before running twill again.
-> >From an overall perspective, I'd say that we're done. One should really
-> require correct HTML to be produced.
-> However, is there any way one could tell twill/tidy/BeautifulSoup to
-> produce an error message when incorrect HTML is discovered? That would
-> be handy in a testing tool like twill. As it stands now, I got a bit
-> lost.

So, I do have the 'tidy_ok' command in there, but tidy is pretty strict.
"Correct HTML" is kind of an interesting concept: there's *pedantically*
correct, and then there's "can be parsed, kinda".

That is, I welcome ideas on how to detect and report "bad" HTML!

I think for the moment I'm going to add two things:

 - the ability to output the HTML *after* tidy is done with it, which in
   this case would have highlighted the problem immediately.

 - the ability to output a 'plain text' rendering of HTML, so that
   humans can see the structure more easily.

Any thoughts?


More information about the twill mailing list