Web Scraping




There are multiple options available to scrape or extract data from web sites. These utilities can also be used to test or unit test web projects.
1. iMacros:
2. Selenium:
2. SimpleTest:
3. PHP Curl


 Web scraping sites that require javascript support

You could certainly write a XUL application with Mozilla (run it with Firefox, Xulrunner etc) which scripts a web browser. Javascript is normally used for such tasks.

What I've found is tricky is suppressing all the kinds of dialogue boxes which the browser would otherwise create - you effectively have to override the behaviour of the XPCOM server classes which are invoked for each type of dialogue, and there are a lot of different ones (for example, if your site decides to redirect to a https site with an expired certificate).

Of course you should NOT use such a mechanism to violate any site's policy on use by robots. Normally you should never submit a form with a robot.


XUL: XML/Javascript markup to design a UI to build a cross platform project such as Firefox
XULRunner is a runtime environment developed by the Mozilla Foundation to provide a common back-end for XUL-based applications. It replaced the Gecko Runtime Environment, a stalled project with a similar purpose.


Screen Scraping from a web page with a lot of Javascript

Screen scraping through AJAX and javascript

How do I implement a screen scraper in PHP?

What's a good tool to screen-scrape with Javascript support?

Are there command line or library tools for rendering webpages that use JavaScript?

command line URL fetch with JavaScript capabliity



In python
$ pip install --user selenium
$ pip install --user nltk # no longer supports html cleanup
$ pip install --user beautifulsoup4
$ python

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://t.co/lw242oZvUz')
#time.sleep(5)
htmlSource = driver.page_source

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlSource, 'html.parser')
str = soup.title.string
print(str)

str = soup.get_text()
print(str.encode('utf-8'))

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))










Subpages (1): iMacros
Comments