[twill] Using Beautiful Soup to Find Images

Terry Peppers peppers at gmail.com
Wed Jul 12 12:23:16 PDT 2006


Had a question for the group related to Beautiful Soup that is
packaged with Twill.

I'm trying to get away from using a regex to pull out all of the
images in a HTML page, I figured I would use Beautiful Soup since it's
included with Twill and it's made for parsing HTML, but I'm having
some seriously weird results.

Basically, if I try to do something like:

>> from twill.commands import *
>> from twill import get_browser
>> from BeautifulSoup import BeautifulSoup
>> u = "http://somedomain.com"
>> go(u)
>> p = get_browser().get_html()
>> soup = BeautifulSoup(p)
>> soup.findAll('img')
>> Null

Wasn't sure if I was doing something wrong, so I installed the
Beautiful Soup egg and did the following:

>> from BeautifulSoup import BeautifulSoup
>> string = """
... <html><body><img src="foo.gif"/><img src="bar.jpg"/></body></html>
... """
>> soup = BeautifulSoup(string)
>> soup.findAll('img')
>> [<img src="foo.gif" />, <img src="bar.jpg" />]

So I'm not sure if Twill comes with a scaled back version of
BeautifulSoup or if I'm just approaching the problem incorrectly. (If
I were a productive member of the OS community I would offer Titus a
patch that would just pull all the images in....).

Anyone?



More information about the twill mailing list