Goodreads Datasets

NOTE: Our datasets have been moved!!

Please see our new webpage about how to download these datasets. This Google site along with the download links in our previous Google Drive will be deprecated soon.

====================================

The datasets were collected in late 2017 from goodreads.com, where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized.

We collected these datasets for academic use only. Please do not redistribute them or use for commercial purposes.

If you are using our datasets, please cite the following papers:

Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]

If you have any questions or find any bugs regarding these datasets, feel free to contact Mengting Wan (m5wan@ucsd.edu).

Latest Updates

We've updated several files in May 2019. We really appreciate those who helped us to identify duplicates and bugs in the previous version!

A github repo is created, which includes a few jupyter notebooks showing how to load the datasets and some basic data explorations.
[May 2019] Review files are uploaded.
[May 2019] Interaction files are updated: duplicates and mismatches are removed.
[May 2019] Meta-data of books are updated: text descriptions are normalized; popular shelf names with negative counts are removed.

Overview

We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by matching book/user/review ids.

Basic Statistics of the Complete Book Graph:

2,360,655 books (1,521,962 works, 400,390 book series, 829,529 authors)
876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings)
876,145 users; 229,154,523 user-book interactions in users' shelves (include 112,310,716 reads and 104,713,520 ratings) (We've updated the interaction files and removed duplicates in May 2019).

Note the complete interaction dataset is very large! We extracted several medium-size subsets by genre, and recommend using these subsets for experimentation first (see "By Genre" for details).

Books

(Meta-Data of Books)

We collected detailed meta-data about 2.36M books. Please see "Books" page for dataset details and sample records.

Quick links:

Complete book graph: goodreads_books.json.gz
Author information: goodreads_book_authors.json.gz
Work information: goodreads_book_works.json.gz
Book series: goodreads_book_series.json.gz
Fuzzy book genres: gooreads_book_genres_initial.json.gz

Shelves

(User-Book Interactions)

We collected more than 229M user-book interactions. Please see "Shelves" page for dataset details and sample records.

Quick links (These files could be very large! Consider using genre-wise datasets if your resources are limited.):

Complete *229m* interactions in 'csv' format (~4.1g): goodreads_interactions.csv
User IDs: user_id_map.csv
Book IDs: book_id_map.csv
Contact Mengting Wan (m5wan@ucsd.edu) if you need a detailed version

Reviews

(Book Review Texts)

We further re-scraped more than 15M records with detailed review text. Please see "Reviews" page for details and sample records.

Quick links:

Complete 15.7m reviews (~5g): goodread_reviews_dedup.json.gz
Review subset (~1.38m reviews) with parsed spoiler tags: goodreads_reviews_spoiler.json.gz
Spoiler subset with original review text: goodreads_reviews_spoiler_raw.json.gz

Code Samples

(Operate the Datasets)

We created several jupyter notebooks to illustrate how to download/read these datasets, and provide some basic explorations of the data.

Quick links:

README!
Download datasets without GUI: download.ipynb
Display sample records: samples.ipynb
Calculate basic statistics: statistics.ipynb:
Explore the interaction data: distributions.ipynb
Explore the review data: reviews.ipynb

By Genre

We notice different interaction densities in different subsets.
Books can be overlapped across different genres (i.e., one book may belong to multiple genres).
The (similar) book graph for each genre may not be self-contained. Those are just subsets of the nodes on the complete book graph (see the meta-data section).
Detailed information about authors, works, book series etc. can be found in the meta-data section.