Code & Datasets

MDD&BD Risk (NAACL 2024)

With the supervision of a psychiatrist, the three trained annotators labeled 1,025 users and their 7,346 anonymized Reddit posts using the open-source text annotation tool Doccano. During annotations, we mainly consider two different label categories: (i) Diagnosis Type (e.g., MDD, BD) and (ii) BD Mood Level with a scale ranging from -3 to 3. If there is any conflict in the annotated labels across the annotators, all the annotators discuss and reach to an agreement under the supervision of the psychiatrists.

The dataset consists the non- COVID-19 news claims, which were published before 21 January 2020, the official date of the outbreak of COVID-19, from the two popular fact-checking services, Snopes and Politifact. In addition, we collected the title and description of relevant YouTube videos, uploaded before or after two weeks of the published date based on each claim.

SceneDAPR (WWW 2024)

 SceneDAPR is a novel scene-level sketch dataset, which can be used to automatically analyze the drawing test, Draw-A-Person-in-the-Rain (DAPR), a psychological drawing assessment used for identifying stressful experiences and coping behavior.

This dataset contains YouTube videos with their recommendation relations from YouTube.  Using the APIs, we crawled the most popular videos of the 12 categories for a week from Aug. 1, 2020 in the United States.

The BD dataset was clinically validated by psychiatrists, including 14 years of posts on bipolar-related subreddits by 818 BD patients, along with the annotations of future suicidality and BD symptoms. 

This dataset contains the assessment of the severity of suicidality of 866 Reddit users who had posted on the r/SuicideWatch subreddit from 2008 to 2015 and their 79,569 posts uploaded to 37,083 subreddits

Depression Vlogs (AAAI 2022)

This dataset contains the balanced numbers of depression and non-depression vlogs thereby helping a model to learn unique depression-related features that are distinct from the non-depression features. 

These datasets contain 7,403 claims whose veracities were either true or false, as reported in Snopes from 2012 to 2017 and 864 claims that were reported in Politifact from 2007 to 2017.

This dataset contains information-, relationship-, and discussion-oriented motivation for comment-tagging.

DCInside Scrapper for K-POP Fandoms  (Quality & Quantity 2022)

This script was used for collecting fandom collaboration data in our paper, "Behind the scenes of K-pop fandom: unveiling K-pop fandom collaboration network".

These datasets contain the suicide-related and non-suicide-related Korean posts from Naver Cafe, and suicide-related dictionary data for generating suicide word embeddings for Chinese, English, and Korean, respectively. 

VCTube (Interspeech 2020)

VCTUBE is open-source Python library, that can automatically generate pair speech data from a given Youtube video URL.

Mental Health Subreddits (Scientific Reports 2020)

This dataset contains users’ posts from Reddit, a popular social media that includes numerous mental-health-related communities (or so-called ‘subreddits’), such as r/depression, r/bipolar, and r/schizophrenia.

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. 

Rumor / Fake News (Scientific Reports 2020)

This dataset contains 125 rumors whose veracity is true, false, or mixed, and 37,417 rumor cascades consisting of 289,202 tweets/retweets written by 176,362 users.

Reddit (ACM COSN 2015)

This dataset contains 695,857 Reddit posts from the top 100 subreddits in terms of the number of subscribers that each have at least one comment, and their 18,093,422 comments; posts and comments are written by 1,455,293 users. Each post contains the author id, title, subreddit id, and timestamp, while each comment contains the original post id, user id, comment text, and a parent from which the comment is generated. The parent can be a comment or a post.

Pinterest (ACM SIGMETRICS 2014, ACM COSN 2015)

This dataset includes 2,974,128 users (i.e., 1,561,374 users found in pin-trees + 1,412,754 users discovered through BFS). The dataset contains 40,800,940 boards, 3,362,100,884 pins, 656,123,740 followers, 302,363,300 followings, 1,392,394 Facebook links, and 183,900 Twitter links. We also obtained the country and gender information of 1,354,132 and 1,392,394 users, respectively.

BitTorrent (ACM SIGMETRICS 2012)