CNN and FOX News

We crawled CNN and FOX news web news from Jan. 1st, 2014 to Apr. 4th, 2014. The category information contained in the RSS feeds for each news article can be viewed as reliable ground truth. Titles, abstracts, and text body contents are extracted as the text view data, and the image associated with the article is stored as the image view data. Since the vocabulary has a very long tail word distribution, We filtered out those words that occur less than or equal to 5 times. All text content is stemmed by portStemmer, and we use l2-normalized TFIDF as text features. For image features, we use 7 groups of color features: Color features include RGB dominant color, HSV dominant color, RGB color moment, HSV color moment, RGB color histogram, HSV color histogram, color coherence vector, and 5 textural features: four Tamura textural features (coarseness, contrast, directionality, linelikeness) and Gabor transform.

Download:

CNN and FOX

If you used the datasets in your research and find them useful, please help cite our work by

@inproceedings{qian2014unsupervised,

title={Unsupervised Feature Selection for Multi-View Clustering on Text-Image Web News Data},

author={Qian, Mingjie and Zhai, Chengxiang},

booktitle={Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management},

pages={1963--1966},

year={2014},

organization={ACM}

}