CNNTop and NPR News

We crawled CNN top stories and National Public Radio (NPR) news. Titles, abstracts, and text body contents are extracted, meanwhile the associated image is stored. Text contents are stemmed and we use l2-normalized TFIDF to represent text subdocuments. For image features, we use RGB dominant color, HSV dominant color, RGB color moment, HSV color moment, RGB color histogram, HSV color histogram, four Tamura textural features (coarseness, contrast, directionality, linelikeness), and Gabor transform. 142 text-image pairs were collected for CNN top stories from Feb. 21st to April 17th, 2011 with 8682 terms and 10 classes. 603 NPR news articles were collected from Apr. 7th to May 7th, 2013, with 17692 terms and 8 classes2.

Download:

CNNTop and NPR

If you used the datasets in your research and find them useful, please help cite our work by

@incollection{qian2014text,

title={Text-Image Topic Discovery for Web News Data},

author={Qian, Mingjie},

booktitle={Advances in Information Retrieval},

pages={675--680},

year={2014},

publisher={Springer}

}