[twill] Using Beautiful Soup to Find Images
Terry Peppers
peppers at gmail.com
Wed Jul 12 12:23:16 PDT 2006
Had a question for the group related to Beautiful Soup that is
packaged with Twill.
I'm trying to get away from using a regex to pull out all of the
images in a HTML page, I figured I would use Beautiful Soup since it's
included with Twill and it's made for parsing HTML, but I'm having
some seriously weird results.
Basically, if I try to do something like:
>> from twill.commands import *
>> from twill import get_browser
>> from BeautifulSoup import BeautifulSoup
>> u = "http://somedomain.com"
>> go(u)
>> p = get_browser().get_html()
>> soup = BeautifulSoup(p)
>> soup.findAll('img')
>> Null
Wasn't sure if I was doing something wrong, so I installed the
Beautiful Soup egg and did the following:
>> from BeautifulSoup import BeautifulSoup
>> string = """
... <html><body><img src="foo.gif"/><img src="bar.jpg"/></body></html>
... """
>> soup = BeautifulSoup(string)
>> soup.findAll('img')
>> [<img src="foo.gif" />, <img src="bar.jpg" />]
So I'm not sure if Twill comes with a scaled back version of
BeautifulSoup or if I'm just approaching the problem incorrectly. (If
I were a productive member of the OS community I would offer Titus a
patch that would just pull all the images in....).
Anyone?
More information about the twill
mailing list