Instagram Influencer Dataset
As a part of the 'Influencer Marketing Research', I collected data from Instagram and share it for the research purpose. There are two separate datasets.
I. Instagram Influencer Dataset: Category Classified
(1) Dataset Download
To download the dataset, please fill out the following form.
33,935 Instagram influencers (labeled with 9 categories)
10,180,500 Instagram posts
Post metadata (JSON files): ~37 GB
Image (JPEG files): ~189 GB
(2) Dataset Description
This dataset contains 33,935 Instagram influencers who are classified into the following nine categories including beauty, family, fashion, fitness, food, interior, pet, travel, and other. We collect 300 posts per influencer so that there are 10,180,500 Instagram posts in the dataset. The dataset includes two types of files, post metadata and image files. Post metadata files are in JSON format and contain the following information: caption, usertags, hashtags, timestamp, sponsorship, likes, comments, etc. Image files are in JPEG format and the dataset contains 12,933,406 image files since a post can have more than one image file. If a post has only one image file then the JSON file and the corresponding image files have the same name. However, if a post has more than one image then the JSON file and corresponding image files have different names. Therefore, we also provide a JSON-Image_mapping file that shows a list of image files that corresponds to post metadata.
The categories of influencers
Overall frame work of the proposed model
(3) Influencer Labeling
To automatically classify influencers by their interests, we proposed a multimodal classifier. The details of the model and data collection method are described in our paper "Multimodal Post Attentive Profiling for Influencer Marketing" published in The Web Conference '20.
(4) Citation for the Instagram Influencer Dataset
"Multimodal Post Attentive Profiling for Influencer Marketing," Seungbae Kim, Jyun-Yu Jiang, Masaki Nakada, Jinyoung Han and Wei Wang. In Proceedings of The Web Conference (WWW '20), ACM, 2020.
@inproceedings{kim2020multimodal,
title={Multimodal Post Attentive Profiling for Influencer Marketing},
author={Kim, Seungbae and Jiang, Jyun-Yu and Nakada, Masaki and Han, Jinyoung and Wang, Wei},
booktitle={Proceedings of The Web Conference 2020},
pages={2878--2884},
year={2020}
}
II. Influencer and Brand Dataset: Sponsorship Detection
(1) Dataset Download
To download the dataset, please fill out the following form.
38,113 Instagram influencers
26,910 brands
1,601,074 Instagram posts
Post metadata (JSON files): ~3GB
Image (JPEG files): ~33 GB
(2) Dataset Description
This dataset contains 1.6 M Instagram posts that mention 26,910 brand names and were published by 38,113 influencers. There are two types of brand mentioning in influencer marketing, including sponsored brand mentioning and non-sponsored brand mentioning. If an influencer gets paid by posting advertising posts that mention the name of the brand, then that is considered a sponsored post. In the dataset, we provide JSON and image files of the posts and their sponsorship label.
(3) Sponsorship Labeling
We label a post as 'Sponsored' if the post either uses the branded content tool or contains sponsorship-related hashtags (e.g., #ad, #sponsored, #paidAd). The details of the data collection method and labeling rules are described in our paper "Discovering Undisclosed Paid Partnership on Social Media via Aspect-Attentive Sponsored Post Learning" published in WSDM '21.
(4) Citation for the Influencer and Brand Dataset
"Discovering Undisclosed Paid Partnership on Social Media via Aspect-Attentive Sponsored Post Learning," Seungbae Kim, Jyun-Yu Jiang, and Wei Wang. In Proceedings of Web Search and Data Mining (WSDM '21), ACM, 2021.
@inproceedings{kim2021discovering,
title={Discovering Undisclosed Paid Partnership on Social Media via Aspect-Attentive Sponsored Post Learning},
author={Kim, Seungbae and Jiang, Jyun-Yu and Wang, Wei},
booktitle={Proceedings of the 14th ACM International Conference on Web Search and Data Mining},
pages={319--327},
year={2021}
}