News Headline Collection

This is Headline Dataset collected over three years (Jan-2014 to Dec-2016) by our AcT crawler at Computer Science & Engineering Department, Indian Institute of Technology, Roorkee.

Dataset contains about 12.5 Million news headlines collected from different domain webpages (such as sports, Entertainment, Business etc.) of about 35 news sources.

The Dataset is in the .sql format. There are two tables: Megatable and source_info. Tables are described as follows.


Megatable (contains information about headline)

  • id: unique identifier for headline
  • newsheadline: News Headline text
  • start_time_stamp: Timestamp of first occurrence of news headline
  • end_time_stamp: Timestamp of last occurrence of news headline
  • URL: URL of news article for the headline
  • source_id: Foreign key in table Source_Info .


The Megatable is divided year wise in three dumps. Following are download links for the dumps.

  1. Year: 2014; Headline count: 3.7M Size: 830MB; link
  2. Year: 2015; Headline count: 4.0M Size: 930MB; link
  3. Year: 2016; Headline count: 4.8M Size: 1.10GB; link


Source_Info (contains information about domain web pages)

  • id: id of the domain web page
  • URL: URL of the domain web page
  • Category: Category (sports, Entertainment, Business etc) of the domain web page
Source_Info: download link

Using these resources

These resources are subject to a CC-BY 2.5 IN license.

Please cite following paper in any published works using these resources

  • Sahisnu Mazumder, Bazir Bishnoi, and Dhaval Patel. 2014. News Headlines: What They Can Tell Us?. In Proceedings of the 6th IBM Collaborative Academia Research Exchange Conference (I-CARE) (I-CARE 2014). ACM, DOI=http://dx.doi.org/10.1145/2662117.2662121


Related Publications

  1. Jayendra Barua and Dhaval Patel : Discovery, Enrichment, and Disambiguation of Acronyms. In Proceedings of 18th International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2016). Springer, DOI=10.1007/978-3-319-43946-4_23.
  2. Kanik Gupta, Vishal Mittal, Bazir Bishnoi, Siddharth Maheshwari, and Dhaval Patel. 2016. AcT: Accuracy-aware crawling techniques for cloud-crawler. World Wide Web 19, 1 (January 2016), DOI: http://dx.doi.org/10.1007/s11280-015-0328-2 .
  3. Avinash Kumar, Dhaval Patel, Nikita Jain. Lightweight System for NE-‐ tagged News Headlines corpus creation. In Proceedings of Big Data and Natural Language Processing workshop hosted at IEEE Big Data 2016.


Contributors

Dr. Dhaval Patel,

Jayendra Barua,

Bazir Bishnoi,

Kanik Gupta