RSSFeedCrawler-Python
RSSFeedCrawler-Python is a crawler for multiple RSS feed sites written in Python. Both text and images could be scraped via HTML parsing.
Description:
CSS selector expression is used to specify the DOM locations for the text and image path.
A crawl_conf.xml file should be provided to specify the feed channels and the CSS selector syntax for the text and image content in a DOM tree.
SHA256 is used instead of MD5 to digest URLs.
Usage:
python RSSFeedCrawler.py <DBUser> <DBPass> <DBCleanUp> <crawl_conf_path> <data_dir_path>
Parameters:
DBUser: User for mysql database
DBPass:Password for mysql database
DBCleanUp: Whether to create a new database for crawling.
data_dir_path: The saving path for the crawled data
crawl_conf_path: File path of crawl-conf.xml containing the URLs of RSS sites to be fetched and their corresponding CSS selector syntax for text and image content
e.g.,
python RSSFeedCrawler.py root 1234 false RSSFeedSites.xml data
Dependencies:
MySQL-python-1.2.4 or later
beautifulsoup4-4.1.3 or later
python-dateutil-1.5 or later
Download:
-----------------------------------
Author: Mingjie Qian
Version: 1.0
Date: Dec. 21st, 2012