Goodreads Datasets

This dataset was collected in late 2017 on goodreads.com, where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized. For each book in each user's shelves, we extracted if the book had been read and if it had been rated (and the associated rating score). Thus for each user-book pair, we obtained the following interaction chain: shelved -- read -- rated (and rating score).

We collect this dataset is for academic use. Please do not redistribute it or use it for commercial purposes.

Note: The complete dataset is very large! We also extracted several medium-size subsets by genre for experimentation (see "by genre" sections below).

If you are using our dataset, please cite the following papers:

  • Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in Proc. of 2018 ACM Conference on Recommender Systems (RecSys'18), Vancouver, Canada, Oct. 2018. [paper]

Any questions or if you find any bugs in this dataset, please contact Mengting Wan (m5wan@ucsd.edu). Your help would be appreciated.

Basic Statistics & Quick Links:

complete:
    • 2,360,655 books (1,521,962 works, 400,390 book series, 829,529 authors)
    • 876,145 users; 229,154,523 user-book interactions in users' shelves (include 112,310,716 reads and 104,713,520 ratings)
    • interactions per user: 261.55; interactions per book: 97.07
by genre:

 - We notice different interaction densities in different subsets. 
 - The (similar) book graph for each genre may not be self-contained. Those are just subsets of the nodes on the complete book graph (see the meta-data section).
 - Detailed information about authors, works, book series etc. can be found in the meta-data section.

Children:

Comics & Graphic:

Fantasy & Paranormal:

History & Biography:

Mystery, Thriller & Crime:

Poetry:

Romance:

Young Adult:

Meta-Data of Books:

Potential Use Cases:

  • Recommender System (together with user-book interaction data)
  • Knowledge Graph
  • Standard Graph Mining, Classification, Clustering etc.

Links:

Complete:
  • Detailed book graph (size~2gb, about 2.3m books): goodreads_books.json.gz
    • Attributes:
      • "title", "description", "#reivews", "#ratings", "average_rating"
      • "isbn", "country_code", "language_code", "num_pages", "publication_day", "publication_month", "publication_year", "url"
      • "series", "authors", "publisher", "work_id"
      • "popular_shelves": top user-generated shelves for a book, used to define genres by goodreads
      • "similar_books": a list of books that users who like the current book also like
  • Detailed information of authors: goodreads_book_authors.json.gz
  • Detailed information of works (the abstract version of a book regardless any particular editions): goodreads_book_works.json.gz
  • Detailed information of book series: goodreads_book_series.json.gz
      • Note: Unfortunately, the series id included here cannot be used for "URL hack"
  • Extracted fuzzy book genres: gooreads_book_genres_initial.json.gz
      • Note: This a very fuzzy version of book genres. These tags are extracted from users' popular shelves by a simple keyword matching process.
By Genre:

Note: Book genres are coarsely defined and extracted from popular shelf names. The same book could belong to more than one genre in this definition.

User-Book Interactions:

Potential Use Cases:

  • Recommender System
  • NLP/Text Mining tasks

Links:

Complete:
  • User-Book Interactions (size~9gb, about 229m records): goodreads_interactions.json.gz
    • Each record indicates a book is included in a user's shelf, where
      • "isRead": indicates if the book has been read by the user
      • "rating": the user's rating score for the book (range from "1" to "5", "0" indicates "not provided")
    • please contact m5wan@ucsd.edu to get the complete interaction dataset
By Genre:

Note: Book genres are coarsely defined and extracted from popular shelf names. The same user-book interaction could belong to more than one genre in this definition.