Code & Datasets

Mental Disorder Detection (EMNLP 2024)

The study introduces KoMOS (Korean Mental Health Dataset with Mental Disorder and Symptoms labels). The dataset contains 6,349 Q&A pairs annotated with four disorders (depression, anxiety, sleep, and eating disorders), 28 symptoms, and contextual factors (duration, frequency, age, and effects) extracted from LLMs.

MDD&BD Risk (NAACL 2024)

With the supervision of a psychiatrist, the three trained annotators labeled 1,025 users and their 7,346 anonymized Reddit posts using the open-source text annotation tool Doccano. During annotations, we mainly consider two different label categories: (i) Diagnosis Type (e.g., MDD, BD) and (ii) BD Mood Level with a scale ranging from -3 to 3. If there is any conflict in the annotated labels across the annotators, all the annotators discuss and reach to an agreement under the supervision of the psychiatrists.

COVID-19 FakeNews (WWW 2024)

The dataset consists the non- COVID-19 news claims, which were published before 21 January 2020, the official date of the outbreak of COVID-19, from the two popular fact-checking services, Snopes and Politifact. In addition, we collected the title and description of relevant YouTube videos, uploaded before or after two weeks of the published date based on each claim.

SceneDAPR (WWW 2024)

SceneDAPR is a novel scene-level sketch dataset, which can be used to automatically analyze the drawing test, Draw-A-Person-in-the-Rain (DAPR), a psychological drawing assessment used for identifying stressful experiences and coping behavior.

Video Recommendation Network (ESWA 2024)

This dataset contains YouTube videos with their recommendation relations from YouTube. Using the APIs, we crawled the most popular videos of the 12 categories for a week from Aug. 1, 2020 in the United States.

Bipolar to Suicide (KDD 2023)

The BD dataset was clinically validated by psychiatrists, including 14 years of posts on bipolar-related subreddits by 818 BD patients, along with the annotations of future suicidality and BD symptoms.

User posts (pkl file)

Word-level Suicide Dictionary (CLPsych 2022)

This dataset contains the assessment of the severity of suicidality of 866 Reddit users who had posted on the r/SuicideWatch subreddit from 2008 to 2015 and their 79,569 posts uploaded to 37,083 subreddits

Suicide Dictionary (csv file) : 5.6KB

Depression Vlogs (AAAI 2022)

This dataset contains the balanced numbers of depression and non-depression vlogs thereby helping a model to learn unique depression-related features that are distinct from the non-depression features.

961 vlogs
816 subjects
acoustic features (npy files): 112MB
visual features (npy files): 597MB
depression/non-depression labels (csv file): 70KB

Preventing rumor spread with deep learning (ESWA 2022)

These datasets contain 7,403 claims whose veracities were either true or false, as reported in Snopes from 2012 to 2017 and 864 claims that were reported in Politifact from 2007 to 2017.

Rumors in Snopes (898K)
- The files are written in TSV forms.
- Dataset Schema is written in the first row of the file.
Rumors in Polififact (462K)
- The files are written in TSV forms.
- Dataset Schema is written in the first row of the file.

User Tagging in Instagram (ESWA 2022)

This dataset contains information-, relationship-, and discussion-oriented motivation for comment-tagging.

The user tagging dataset
- 747 comments: information- (n=313), relationship- (n=369), and discussion-oriented motivation (n=65).
- a comment text (with removing tagged username), the corresponding tagging motivation, and the associated the post text.

DCInside Scrapper for K-POP Fandoms (Quality & Quantity 2022)

This script was used for collecting fandom collaboration data in our paper, "Behind the scenes of K-pop fandom: unveiling K-pop fandom collaboration network".

Suicide-oriented Word Embedding & Suicide Dictionary for English, Chinese, and Korean (EMNLP Findings 2020)

These datasets contain the suicide-related and non-suicide-related Korean posts from Naver Cafe, and suicide-related dictionary data for generating suicide word embeddings for Chinese, English, and Korean, respectively.

Suicide Dictionary
Suicide-oriented Word embedding

VCTube (Interspeech 2020)

VCTUBE is open-source Python library, that can automatically generate pair speech data from a given Youtube video URL.

Mental Health Subreddits (Scientific Reports 2020)

This dataset contains users’ posts from Reddit, a popular social media that includes numerous mental-health-related communities (or so-called ‘subreddits’), such as r/depression, r/bipolar, and r/schizophrenia.

Posts of Mental-health-related Subreddits (.csv format)

Instagram Influencer Marketing (WWW 2020)

This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other.

33,935 Instagram influencers (labeled with 9 categories)
10,180,500 Instagram posts
Post metadata (JSON files): ~37 GB
Image (JPEG files): ~189 GB

Rumor / Fake News (Scientific Reports 2020)

This dataset contains 125 rumors whose veracity is true, false, or mixed, and 37,417 rumor cascades consisting of 289,202 tweets/retweets written by 176,362 users.

Rumor Information (29K)
- Each columns are seperated by tab(‘\t’).
- Dataset Schema is written in the first row of the file.

Reddit (ACM COSN 2015)

This dataset contains 695,857 Reddit posts from the top 100 subreddits in terms of the number of subscribers that each have at least one comment, and their 18,093,422 comments; posts and comments are written by 1,455,293 users. Each post contains the author id, title, subreddit id, and timestamp, while each comment contains the original post id, user id, comment text, and a parent from which the comment is generated. The parent can be a comment or a post.

Data Field & Description (3.3KB)
Subreddit Information (341.7KB)
Post Information (250.7MB)
Comment Information (6.3GB)

Pinterest (ACM SIGMETRICS 2014, ACM COSN 2015)

This dataset includes 2,974,128 users (i.e., 1,561,374 users found in pin-trees + 1,412,754 users discovered through BFS). The dataset contains 40,800,940 boards, 3,362,100,884 pins, 656,123,740 followers, 302,363,300 followings, 1,392,394 Facebook links, and 183,900 Twitter links. We also obtained the country and gender information of 1,354,132 and 1,392,394 users, respectively.

Data Field & Description (1.0KB)
Pinterest Profile Information of Users (66.9MB)
Facebook Profile Information of Users (28.3MB)
Twitter Profile Information of Users (5.1MB)
Board Information (1.05GB)
Pin Information (zip, windows) (38.6MB) / Pin Information (tarball, linux) (39.1MB)
Pin-Tree Information (48.9MB)

BitTorrent (ACM SIGMETRICS 2012)

Torrent Dataset
- Our torrent datasets have been collected for 77 days from February 14 to May 1, 2011. The crawling agent fetched torrent data of 120,550 torrents from TPB, which contains 3,163,685 files whose total volume is around 120 TB. Throughout this paper, we investigate the bundling practice of the seven major (91% and 90% in terms of the torrent counts and data volume, respectively) content categories given by TPB: Movie, TV, Porn, Music, Application, Game and E-book.
- [Torrent data 1 (2011.02.14 ~ 2011.03.25) (3.8MB)]
- [Torrent data 2 (2011.03.25 ~ 2011.05.01) (2.9MB)]
- [Bundle Torrent File data (56.1MB)]
- [Single Torrent File data (743KB)]
Swarm Dataset
- For the torrents discovered between March 25 and April 26, we have periodically (once every two hours) captured swarm snapshots, to investigate access pattern of peers participating in the swarms. We restrict the swarm dataset collection and analysis to those of the torrents collected between March 25 and April 26, due to the performance limitations of our monitoring facilities, which consist of 14 desktop PCs (admittedly research-grade).
- [Swarm data (11.48GB)]

Google Sites

Report abuse