RSSFeedCrawler-Python

RSSFeedCrawler-Python is a crawler for multiple RSS feed sites written in Python. Both text and images could be scraped via HTML parsing.

Description:

CSS selector expression is used to specify the DOM locations for the text and image path.

A crawl_conf.xml file should be provided to specify the feed channels and the CSS selector syntax for the text and image content in a DOM tree.

SHA256 is used instead of MD5 to digest URLs.

Usage:

python RSSFeedCrawler.py <DBUser> <DBPass> <DBCleanUp> <crawl_conf_path> <data_dir_path>

Parameters:

DBUser: User for mysql database

DBPass:Password for mysql database

DBCleanUp: Whether to create a new database for crawling.

data_dir_path: The saving path for the crawled data

crawl_conf_path: File path of crawl-conf.xml containing the URLs of RSS sites to be fetched and their corresponding CSS selector syntax for text and image content

e.g.,

python RSSFeedCrawler.py root 1234 false RSSFeedSites.xml data

Dependencies:

MySQL-python-1.2.4 or later

beautifulsoup4-4.1.3 or later

python-dateutil-1.5 or later

Download:

RSSFeedCrawler-Python.zip

-----------------------------------

Author: Mingjie Qian

Version: 1.0

Date: Dec. 21st, 2012

Page updated

Google Sites

Report abuse