[twill] check_links doesn't follow

Lars Stavholm stava at telcotec.se
Mon Jan 29 11:16:44 PST 2007


Titus Brown wrote:
> On Mon, Jan 29, 2007 at 07:46:42PM +0100, Lars Stavholm wrote:
> -> Titus Brown wrote:
> -> > Hi, Lars,
> -> > 
> -> > neither tidy nor BeautifulSoup like the conditionals in the HTML on this
> -> > page; the culprit on jonitec.se appears to be this:
> -> > 
> -> > ===
> -> > 
> -> > <!-- CorrectPNG! Module : compliance patch for microsoft browsers -->
> -> > <!--[if gte IE 5.5000]>
> -> > <!--[if lte IE 7]><script language="JavaScript" src="http://www.jonitec.se/mambots/system/botcorrectpng/correctpng.js"></script><![endif]-->
> -> > <![endif]-->
> -> > 
> -> > ===
> -> > 
> -> > That is, if I remove that from the page, showlinks works fine. The
> -> > <![endif]--> is specifically what's causing the problem; if you put
> -> > <!--[endif]--> link parsing works.
> -> > 
> -> > Do you have any thoughts on how to deal with this?  It's obviously
> -> > incorrect HTML but it shouldn't be breaking things this badly ;).
> -> 
> -> Thanks Titus! Problem solved in the sense that I fixed the parsing
> -> problem by simply making sure that correct HTML is produced following
> -> your advice and findings. I'll remember to try and make sure that I
> -> have correct HTML before running twill again.
> -> 
> -> >From an overall perspective, I'd say that we're done. One should really
> -> require correct HTML to be produced.
> -> 
> -> However, is there any way one could tell twill/tidy/BeautifulSoup to
> -> produce an error message when incorrect HTML is discovered? That would
> -> be handy in a testing tool like twill. As it stands now, I got a bit
> -> lost.
> 
> So, I do have the 'tidy_ok' command in there, but tidy is pretty strict.
> "Correct HTML" is kind of an interesting concept: there's *pedantically*
> correct, and then there's "can be parsed, kinda".

Speaking for myself, I would settle for strict since I'm
developing these sites myself, and I want the result to
be correct and validated HTML.

On the other hand, "can be parsed, kinda'" could be useful
as well, but it's like opening a can of worms, what's the
boundaries for any approximations? Sounds a bit too difficult
to me (but I'm not that savvy on these things).

Forgive my ignorance. I put back the HTML error and tried the following:

go http://www.jonitec.se
tidy_ok
extend_with check_links
check_links www\.jonitec\.se

...and subsequently got:

>> EXECUTING FILE jonitec.se.twill
==> at http://www.jonitec.se
Imported extension module 'check_links'.
(at /usr/lib/python2.4/site-packages/twill/extensions/check_links.pyc)

in check_links
no links to check!?
--
1 of 1 files SUCCEEDED.

...i.e. same situation as before.

Am I using tidy_ok incorrectly?

> That is, I welcome ideas on how to detect and report "bad" HTML!
> 
> I think for the moment I'm going to add two things:
> 
>  - the ability to output the HTML *after* tidy is done with it, which in
>    this case would have highlighted the problem immediately.

Sounds like a plan.

>  - the ability to output a 'plain text' rendering of HTML, so that
>    humans can see the structure more easily.

Even better.

> Any thoughts?

Above.
/Lars



More information about the twill mailing list