This is Headline Dataset collected over three years (Jan-2014 to Dec-2016) by our AcT crawler at Computer Science & Engineering Department, Indian Institute of Technology, Roorkee.
Dataset contains about 12.5 Million news headlines collected from different domain webpages (such as sports, Entertainment, Business etc.) of about 35 news sources.
The Dataset is in the .sql format. There are two tables: Megatable and source_info. Tables are described as follows.
Megatable (contains information about headline)
The Megatable is divided year wise in three dumps. Following are download links for the dumps.
Year: 2014; Headline count: 3.7M Size: 830MB; link
Year: 2015; Headline count: 4.0M Size: 930MB; link
Year: 2016; Headline count: 4.8M Size: 1.10GB; link
Source_Info (contains information about domain web pages)
Source_Info: download link
These resources are subject to a CC-BY 2.5 IN license.
Please cite following paper in any published works using these resources
Dr. Dhaval Patel,
Jayendra Barua,
Bazir Bishnoi,
Kanik Gupta