BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Basic Selectors

#All elements named <div>

soup.select('div')

#The element with an id attribute of author

soup.select('#author')

#All elements that use a CSS class attribute named notice

soup.select('.notice')

#All elements named <span> that are within an element named <div>

soup.select('div span')

#All elements named <span> that are directly within an element named <div>, with no other element in between

soup.select('div > span')

#All elements named <input> that have a name attribute with any value

soup.select('input[name]')

#All elements named <input> that have an attribute named type with value button

soup.select('input[type="button"]')

Most Useful selectors in BeautifulSoup

1. To extract all links

soup.find("a",attrs={"class":"next"}).get('href')

2. To Canonical URLs

soup.find_all('link', {'rel': 'canonical'})[0]['href']

3. To Cut Selected Element in DOM

<div class="sample">Required this text alone.<span> I don't want this text</span></div>

>>soup.text() => Required this text alone.I don't want this text

>>[i.extract() for i in soup('span')]

<div class="sample"> this is sample text </div>

>>soup.text() => Required this text alone.

Google Sites

Report abuse