PHASE I
Background:
As the book industry grows increasingly diverse and competitive, authors and publishers are consistently looking for the next best sellers. Book diversity has expanded and has created a sense of unforeseeable future in terms of success and popularity. Market trends and competition play a significant role because typically they will determine what books will be more profitable/successful in the current market. However, by using specific data analysis methods, we can utilize historical datasets gathered by primary book reviews/sales sites, such as Goodreads, which contain information on specific book features, that we can draw predictions on what contributes to a book's overall sucess.
Why is this important?
The book industry is a complex and competitive market, and there isn't a lot of public and available information on how books, and what attributes to a book's success. In "The Book Industry's Quest for Data Intelligence", Ellen Harvey emphasizes the need and importance for collecting data on books in order to read the market and know "the audience". Publishers and publication companies have difficulty being able to read and understand what the customers want, and typically the only people who have access to knowing the what consumers want, are book retailers who have the information on what books are selling the fastest, and what are the customer's reading habits. Only till recently by "collecting data from a variety of channels, publishers are gleaning new insights about current and potential customers" (Harvey, ""The Book Industry's Quest for Data Intelligence"). By having access to this data, publishers and authors will know which books and what kinds of books are popular in the market, and how these books will maximize profits. "Gathering the right data helps to grasp existing and potential customers' thought patterns and behaviors, which helps to pain a detailed picture, allowing publishers to predict future purchases and forecast sales" (Ribbonfish, "Spotlight: How Data influences the book publishing industry").
Introduction/Overview:
The purpose of this project is to analyze and predict book's rating success. In my analysis, I will be utilizing datasets provided by Goodreads which contains specific feature information such as the authors, publishers/publication companies, books average ratings, reviews, publication years, length of books (by pages), languages, genre, fiction/nonfiction classification, authors average ratings, books and authors ratings counts.
Research Questions:
These features will be crucial in my machine learning model in order to explore the following questions contributing to book's literary success:
Would a nonfiction or fiction book prove to be more successful?
Does author's historical ratings contribute to the success of the next book he/she would publish?
Does the length of a book affect a book's literary success?
Are books part of a series more successful than a book not part of a series?
Phase 1: Introduction
Dataset features:
Goodreads UCSD Bookgraph and Genre dataset:
29 columns and 2,360,655 rows
PHASE II
Exploratory Data Analysis:
Import and combine json files: goodreads_books_json, goodreads_genres_json, and goodreads_authors_json.
Report essential findings, trends, and patterns:
Goodreads books dataset contains detailed specific information on each book: titles of the book, author's name, books' edition information, ID's of books in the series if the book is part of a series, publishing company, publishing year and date, number of pages of the books, average rating of the booking, text reviews count, description of the book, and language.
Extracted fiction and non fiction classifications of books, and extracted information on if the books were part of a series or not.
We also have information on all the author's books, the authors average rating, and authors ratings counts. (to measure author's popularity)
Genre information: what categories each book is classified under- non-fiction or fiction.
Series information: what books are part of a series and which books aren't.
Book's average rating is not based off books rating count which changes the true the rating of that book
author's average rating is also not based of authors rating count, which changes the author's true rating
Trends found:
Book's part of a series appear to have higher ratings - (Figure 1)
Top rated authors are authors who have written book series or have higher historical author popularity- (Figure 2)
Authors who have written the most books- (Figure 3)
Fiction books tend to be more successful - (Figure 4)
Most books are published in October- (Figure 5)
Positive linear relationships between book's log weighted rating and book's ratings count (Figure 6): the more people rated the book, the more popular and successful the book was.
Positive linear relationships between book's log weighted rating and book's reviews count (Figure 7): the more people reviewed the book, the more popular and successful the book was.
Bonus features to explore:
d. Does author's historical success and popularity contribute to the author's next books' success?
e. Does author's productivity (in terms of how many books they've written, and how many pages each books is) contribute to the books' success?
Github Repository:
https://github.com/amk986/Capstone-Project-Predicting-Book-success
Phase II: EDA
Figure 1: Books overall ratings series vs not part of a series
Figure 2: Highest rated authors
Figure 3: Authors who have written the most books
Figure 4: Book's overall rating fiction books vs nonfiction books
Figure 5: Number of Books Published per Month
Figure 6: Books Ratings Count vs. Book's weighted rating
Figure 7: Books Reviews Count vs. Book's weighted rating
Decision Tree Classification Modeling:
Set up: choosing and refining which features to test
Feature Variables:
authors id - the author's ID numbers, each author has written multiple different books
publisher id - publishing company's ID's, each publishing company could have published multiple different books
fiction/nonfiction classification - binary classification if a book is nonfiction or fiction
series or not book series classification - binary classification if a book is part of series or not part of a series (0 - not part of series, 1: part of a series)
book titles - title of the books
book description - brief description of books
book length- number of pages per book
total books - the total number of books each author has written
Target variable:
book's log weighted rating. - true rating of the book by taking into account the number of book ratings collected for the book, and the average rating the book received. I took the weighted average rating to capture the true rating of a book: 1 book could be rated at 5 stars but the rating count could be 1 (only 1 person rated that book) which doesn't necessarily mean the book was successful. Weighted averages assign importance to each number in a set of numbers. (Sareen, "Breaking Down Goodreads Dataset using Python").
note: in Sareen's analysis, she only takes into account the weighted rating, I chose to take the logarithmic value of weighted average to limit the high ranges of ranks.
Splitting into test and training datasets.
X: Feature variables
y: target variable
Initial Results:
53.5% accuracy result
PHASE III
Application of Machine Learning Algorithms, Final Results, and Conclusions
Final Accuracy Test results
Accuracy result: 63.5%
Accuracy result: 70.6%
Accuracy result: 54.3%
Methodology
Machine Learning Algorithms selection:
After conducting EDA and looking into which feature variables play a significant role in predicting a book's literary success, I decided to run three different machine learning algorithms to determine which machine learning algorithm would provide the highest accuracy result. Utilizing a classification approach, I chose to run a Decision Tree Classifier, Support Vector Machine, and K-Nearest Neighbor algorithms.
Each machine learning algorithm presents its' own individual advantages. Decision Tree classification are useful for supervised learning , they are easy to interpret, and don't require feature scaling (Galarnyk, "Understanding Decision Trees for Classification in Python" ). Classification Machine Learning algorithms are beneficial when datasets contain multiple classification groups. K-nearest neighbor is nonparametric and doesn't make assumptions based off the data distributions, and it mainly classifies objects by feature similarity (Gahukar, "Classification Algorithms in Machine Learning...").
I divided the classifications of the target variable by five classes: bad, average, above average, good, excellent. My categorical columns were feature variables: book series or not classification and fiction and nonfiction classification. In my initial trial, Decision Tree Classifier proved to provide the highest accuracy results.
Results/Conclusions:
After trial and error of refining my feature variables, vectorizing and transforming variables into numerical values, the Support Vector Machine model provided the highest accuracy result of 70.58%. The feature variables utilized in the machine learning model that provided the highest results were: book's title, book's description, book's ratings counts, books' reviews counts, number of pages per book, book's fiction or nonfiction classification, book's series or not classification, total number of books authors had written. These features play a significant role in predicting how successful a book will be.
Support Vector Machines have the advantage of being able to function effectively in high dimensional spaces and "uses a subset of training points in the decision function so it's memory efficient" (Gahukar, "Classification Algorithms in Machine Learning..."). After my initial trials of running SVM, I tried Hyperparameter tuning to allow the algorithm define the parameters that govern the entire model prior to training the model and thus increase the accuracy result. I tried to limit the number of hyperparameter optimization techniques. Choosing effective hyperparameters allows "efficient search of parameters in space, and ease to manage a large set of experiments for hyperparameter tuning" (Prabhu, "Understanding Hyperparameter techniques and Optimization Techniques").
Limitations:
Throughout the development and ongoing process of my project there were a number of limitations and obstacles that I faced. In the Exploratory Data Analysis portion of the project, I spent a lot of time cleaning the data and extracting the features I wanted to use in the machine learning models. Each file I utilized from USCD took me a significant amount of time to clean out nested lists and dictionaries. For example, in the Goodreads genre file, I had to decide if I wanted to focus on extracting book's genres or fiction/nonfiction classifications, and most books were categorized by a combination of genre and nonfiction/fiction classification, such as "historical fiction" or "young adult fiction". In addition it took time to clean out the nested dictionaries and extract only the fiction and non fiction classifications. Similarly with the book series classification. Initially i wanted to utilize Goodreads book series dataset information but it didn't prove to be very useful so I chose only chose to classify by Goodreads books dataset's "book series column"; if there was a book_id included in that list then the book was classified as part of book series. Another limitation I faced was that because there was a lot of detailed book information and features, and there was 2.3M books in the dataset, there was only so much data I could read in and process in my notebook without it crashing (similarly with running in my machine learning algorithms, there was only so much data I could read in).
In addition to the limited dataset, although it has a lot of various specific information about the books, it didn't have the feature information I wanted to explore for and research. I had to create the feature variables from the dataset provided (log_weighted_rating, series or not, fiction or non Fiction).
In terms of running the machine learning algorithms, I could only use a certain amount of data and even then it took a significant amount of time to split the datasets into training and test sets, to fit the models, and to score the results. As mentioned, I tried Hyperparameter tuning to help improve the accuracy results, but similarly it took too long to run. I also tried limiting the number of classes to 3 instead of 5 when scoring the ratings, but that also didn't improve the accuracy result. Eventually, after limiting the number of feature variables to limit the number of parameters being run, which in turn decreased running time, I was able to improve the accuracy results on all three algorithms.
Further Research:
There is a significant amount of research needed to predict what features attribute to a book's success. It would be essential and crucial to explore the revenue financial aspect of book success: what books prove to be the most profitable? One study, analyzed and tried predicting book's sales prior to publication (Wang, Yucesoy, Varol, "Success in books: predicting book's sales before publication"). More data needs to be collection on book's sales information to be able to conduct further research.
Book's representation in the media would also be interesting to explore and conduct further research on. How many times are people searching for books online, or how much are potential books advertised online to encourage readers to buy them. In addition, it would be interesting to see what books turned into movies and how successful the movies were; historically books that have movies or TV shows made based off of them tend to be successful and popular amongst readers/public (for example Game of Thrones is a popular book series that turned into an even more successful TV series).
As I mentioned, I had considered to explore author's historical popularity and literary success and how it could affect future publications by the same authors. What kinds of books attributed to the author's success? Or what books should authors write in order to be more successful in the future?
Lastly, another topic of interest to see what attributes to a book's success is time and seasonality. Are there certain times of the year that books are more successful? What kinds of books are best published in what times of the year? Are there certain times of the year that books are most profitable? Publishers and authors can use this information to know when to publish books to maximize success and profit, but in order to utilize this information we would need more historical information and data on book's publication times.
References:
Goodreads UCSD Bookgraph:
https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home?authuser=0
Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]
Sareen, S. (2021). Breaking Down Goodreads Dataset using Python. Retrieved 11 May 2021, from https://towardsdatascience.com/breaking-down-goodreads-dataset-using-python-388e9b9d6352
What is the Weighted Average?. (2021). Retrieved 11 May 2021, from https://learn.robinhood.com/articles/N7yD1p14AbaYIdlXmnSlf/what-is-the-weighted-average/
Takahashi, Y. (2021). How to Combine Textual and Numerical Features for Machine Learning in Python. Retrieved 11 May 2021, from https://towardsdatascience.com/how-to-combine-textual-and-numerical-features-for-machine-learning-in-python-dc1526ca94d9
Galarnyk, M. (2021). Understanding Decision Trees for Classification in Python - KDnuggets. Retrieved 13 May 2021, from https://www.kdnuggets.com/2019/08/understanding-decision-trees-classification-python.html
Comparing Support Vector Machines and Decision Trees for Text Classification. (2021). Retrieved 13 May 2021, from https://www.codementor.io/blog/text-classification-6mmol0q8oj
Mishra, A. (2021). Metrics to Evaluate your Machine Learning Algorithm. Retrieved 13 May 2021, https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234
Gahukar, G. (2021). Classification Algorithms in Machine Learning…. Retrieved 13 May 2021, from https://medium.datadriveninvestor.com/classification-algorithms-in-machine-learning-85c0ab65ff4
Gahukar, G. (2021). Classification Algorithms in Machine Learning…. Retrieved 13 May 2021, from https://towardsdatascience.com/understanding-hyperparameters-and-its-optimisation-techniques-f0debba07568
Wang, X., Yucesoy, B., Varol, O. et al. Success in books: predicting book sales before publication. EPJ Data Sci. 8, 31 (2019). https://doi.org/10.1140/epjds/s13688-019-0208-6
Harvey, E. (2021). The Book Industry’s Quest for Data Intelligence. Retrieved 13 May 2021, from https://www.bookbusinessmag.com/article/the-book-industry-s-quest-data-intelligence/
How Data Influences The Publishing Industry | Ribbonfish. (2021). Retrieved 13 May 2021, from https://ribbonfish.co.uk/blog/spotlight-data-influences-book-publishing-industry/
Rowe, W. (2021). How to Create a Machine Learning Pipeline. Retrieved 13 May 2021, from https://www.bmc.com/blogs/create-machine-learning-pipeline/
Koen, S. (2021). Architecting a Machine Learning Pipeline. Retrieved 13 May 2021, from https://towardsdatascience.com/architecting-a-machine-learning-pipeline-a847f094d1c7