[twill] check_links doesn't follow
Titus Brown
titus at caltech.edu
Mon Jan 29 11:04:28 PST 2007
On Mon, Jan 29, 2007 at 07:46:42PM +0100, Lars Stavholm wrote:
-> Titus Brown wrote:
-> > Hi, Lars,
-> >
-> > neither tidy nor BeautifulSoup like the conditionals in the HTML on this
-> > page; the culprit on jonitec.se appears to be this:
-> >
-> > ===
-> >
-> > <!-- CorrectPNG! Module : compliance patch for microsoft browsers -->
-> > <!--[if gte IE 5.5000]>
-> > <!--[if lte IE 7]><script language="JavaScript" src="http://www.jonitec.se/mambots/system/botcorrectpng/correctpng.js"></script><![endif]-->
-> > <![endif]-->
-> >
-> > ===
-> >
-> > That is, if I remove that from the page, showlinks works fine. The
-> > <![endif]--> is specifically what's causing the problem; if you put
-> > <!--[endif]--> link parsing works.
-> >
-> > Do you have any thoughts on how to deal with this? It's obviously
-> > incorrect HTML but it shouldn't be breaking things this badly ;).
->
-> Thanks Titus! Problem solved in the sense that I fixed the parsing
-> problem by simply making sure that correct HTML is produced following
-> your advice and findings. I'll remember to try and make sure that I
-> have correct HTML before running twill again.
->
-> >From an overall perspective, I'd say that we're done. One should really
-> require correct HTML to be produced.
->
-> However, is there any way one could tell twill/tidy/BeautifulSoup to
-> produce an error message when incorrect HTML is discovered? That would
-> be handy in a testing tool like twill. As it stands now, I got a bit
-> lost.
So, I do have the 'tidy_ok' command in there, but tidy is pretty strict.
"Correct HTML" is kind of an interesting concept: there's *pedantically*
correct, and then there's "can be parsed, kinda".
That is, I welcome ideas on how to detect and report "bad" HTML!
I think for the moment I'm going to add two things:
- the ability to output the HTML *after* tidy is done with it, which in
this case would have highlighted the problem immediately.
- the ability to output a 'plain text' rendering of HTML, so that
humans can see the structure more easily.
Any thoughts?
cheers,
--titus
More information about the twill
mailing list