RSSFeedCrawler

RSSFeedCrawler is a crawler for multiple RSS feed sites written in Java. Both text and images could be scraped via HTML parsing.

Description:

CSS selector expression is used to specify the DOM locations for the text and image path.

An XML file should be provided to specify the feed channels and the CSS selector syntax for the text and image content in a DOM tree.

SHA256 is used instead of MD5 to digest URLs.

SourceForge:

https://sourceforge.net/projects/rssfeedcrawler

GitHub:

https://github.com/MingjieQian/RSSFeedCrawler

Usage:

java -jar path_of_RSSFeedCrawler.jar -sys_conf path_of_sys_conf.txt

e.g.,

java -jar /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/RSSFeedCrawler.jar -sys_conf/home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/sys_conf.txt

If you want to clean up the old database and create a new database before crawling, please use the following command:

java -jar /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/RSSFeedCrawler.jar -sys_conf /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/sys_conf.txt -db_clean_up

Configurations:

All the parameters for the crawler are initialized from a file named sys_conf.txt. The sys_conf.txt specifies

1. The saving path for the crawled data

2. File path of an XML file containing the URLs of the RSS sites and XPath for its text and image content

3. Username for mysql database

4. Password for mysql database

e.g.,

data_dir = ***/RSSFeedCrawler/data

crawl_conf = ***/RSSFeedCrawler/conf/crawl-sites.xml

db_user = root

db_pass = 1234

Dependencies:

jsoup-*.*.*.jar

mysql-connector-java-*.*.**-bin.jar

Download:

RSSFeedCrawler.zip

-----------------------------------

Author: Mingjie Qian

Version: 1.0

Date: Dec. 18th, 2012