There are multiple options available to scrape or extract data from web sites. These utilities can also be used to test or unit test web projects.
What I've found is tricky is suppressing all the kinds of dialogue boxes which the browser would otherwise create - you effectively have to override the behaviour of the XPCOM server classes which are invoked for each type of dialogue, and there are a lot of different ones (for example, if your site decides to redirect to a https site with an expired certificate).
Of course you should NOT use such a mechanism to violate any site's policy on use by robots. Normally you should never submit a form with a robot.
XULRunner is a runtime environment developed by the Mozilla Foundation to provide a common back-end for XUL-based applications. It replaced the Gecko Runtime Environment, a stalled project with a similar purpose.
How do I implement a screen scraper in PHP?
$ pip install --user selenium
$ pip install --user nltk # no longer supports html cleanup
$ pip install --user beautifulsoup4
from selenium import webdriver
driver = webdriver.Chrome()
htmlSource = driver.page_source
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlSource, 'html.parser')
str = soup.title.string
str = soup.get_text()
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)