Data
Most social media data is not sharable, or at least not sharable in its raw format. Below are links to some data sets available for use. For more information about specific data sets, see the link next to the data set to obtain author contact information.
Capitol Siege Information Operation
A fairly full recording of the Capitol Siege information operation, starting with the White House Media Summit in July of 2019. The entire set contains nearly 220 million tweets, the bulk of that being interactions with sixteen individuals who attended the July 2019 event. From Neal Rauhauser. [LINK]
Sets contain the numeric IDs per Twitter's policy on academic publication. The original data remains extant in our ArangoDB/Elasticsearch cluster.
Here are one line descriptions of each of the eighteen datasets included in this project:
Bannon14 – Steve Bannon and close associates.
Chans9 – 8chan leadership.
Dirtydozen14 – Republican politicans who received messaging support from IRA.
Disloyal121 – Republican House members who signed the Texas AG suit against Pennsylvania.
Dsoldiers6 – Six attendees of the Digital Soldiers conference who are not in any other set.
Dumont25 – Priscilla Adams-Dumont, creator of the Q persona’s legend.
Flynn12 – Mile Flynn and close associates.
Flynnoath209 – accounts that post videos of themselves “taking the oath”.
Julyfourth8 – Eight U.S. Congress members who spent our 2017 Independence Day in Moscow.
Mindyresearch34 – Mindy Waite, part of Qanon core, and close associates.
Power10new30 – Roger Stone’s Power10 social botnet was destroyed in 2019, these were users.
Proudboys106 – Proud Boys accounts collected in late 2020.
Qanonops16 – Core Qanon operators per The Thinkin’ Project.
Qappanon22 – qmap.pub domain operator and close associates.
Qcongress68 – Congressional candidates who expressed support for Qanon conspiracy theories.
Stone – Roger Stone’s personal account.
Thestorm111 – Accounts who were pushing the Qanon trope of The Storm.
Whmediasummit16 – White House Media Summit Attendees.
Amazon, Yelp, TripAdvisor review datasets -
http://shebuti.com/collective-opinion-spam-detection/
http://cs.unm.edu/~aminnich/trueview/
Buzzfeed Election Data Set2 - This data set, gathered during the months leading up to the 2016 United States Presidential Election, is a collection of real and fake news stories with the highest Facebook engagement. Buzzfeed News gathered this data using keyword searches on the content analysis tool BuzzSumo (Horne & Adali, 2017). [LINK]
Buzzfeed Hyperpartisan Facebook Page Dataset - (Granik & Mesyura, 2017; Potthast, Kiesel, Reinartz, Bevendorff, & Stein, 2017). Not to be confused with the previous Buzzfeed dataset, this dataset contains a series of articles published on Facebook over the span of a week in late September 2016. Each article was fact-checked by 5 Buzzfeed journalists. The corpus includes 1,627 articles—828 from mainstream news agencies, 356 from left-wing sources, and 545 from right-wing sources. [LINK]
Dataworld - There are 6 datasets on disinformation. [LINK]
Media Cloud 2016 election data - [LINK]
Obama Administration Social Media Archives - [LINK]
PLOS one: ISIS - Twitter data collected to identify networks of actors who were members of or supported ISIS, de-identified [LINK]
ProQuest Congressional Government Social Media - Search Facebook and Twitter Members of Congress and Government Agencies going back to 2013. [LINK]
Statista - Data and facts about fake news. [LINK]
Social Media for Public Health - Flu Vaccination Tweets, Vaccination Sentiment and Relevance Tweets, and Zika Conspiracy Tweets Data Sets [LINK]
TAMU Twitter honeypot dataset - [LINK]
Twitter synchronized malicious behavior data - [LINK]
Wikipedia hoax dataset - by Srijan Kumar, Robert West and Jure Leskovec. [LINK]
Wikipedia personal attack dataset - by Ellery Wulczyn; A collection of data sets on Wikipedia Talk page discussions. [LINK]
Wikipedia vandals - by Srijan Kumar, Francesca Spezzano and V.S. Subrahmanian. [LINK]