Alyssa Hayo[1]

Book Genre Prediction

Introduction

Important Link(s)

GitHub: https://github.com/ahayo1/Capstone
Stage 1 YouTube Video: https://youtu.be/CeDbHbOpkLI
Stage 2 YouTube Video: https://youtu.be/_9AgIxoxcTo
Stage 3 YouTube Video: https://youtu.be/Rb-Z5kSlBFk
Full Walkthrough of Project: https://youtu.be/K3yZMmrFM4E

Google Slides - Final Presentation

Capstone Final - Book Genre Prediction

Project Details

Classify books into genres based off their book covers using CNNs and alternatives to CNNs
Classify books into genres based off their book summary using RNNs and alternatives to RNNs
Compare the models to one another

Other Research

There has been research on predicting books based on their summary and based on their cover. When looking at book summary data, research has been done looking at RNNs, and when looking at book cover data, research has been done looking at CNNs.

Ernest Ng [5] looked at book data from Goodreads.com and used Keras LSTMs based RNNs to predict book genres based on their book descriptions. Viridana Romero Martinez [4] first looked at movie poster information to predict their genre and then decided to look at book covers to predict their genre by using CNNs in PyTorch.

Where my project differs is that there hasn't been research into comparing what performs better. I'll be comparing RNNs and CNNs to determine if summaries or covers are better at predicting what genre the book is.

Hypothesis

I think that the model that will use the book summary for predictions will perform better than the model that will use the book cover for predictions. Book covers will be different based on the author's artist and I believe that the book summary provides readers better details about what genre the book will be.

Datasets

CMU Book Summary Dataset
- http://www.cs.cmu.edu/~dbamman/booksummaries.html
- 16,559 Books
- Columns: Wikipedia ID, Freebase ID, Book Title, Book Author, Publication Date, Genres
GitHub Dataset
- https://github.com/uchidalab/book-dataset
- 204,906 Unique Books
- Amazon ID, Filename, Image URL, Book Title, Book Author, Category ID, Category

Environment

The main environment I used for my project is Google Colab. At the beginning of my project I tried to do everything through the hosted runtime and grabbed all of my files through Google Drive. The main problems I had with using the hosted runtime were: needing to be proactive to make sure the session didn't timeout, random session disconnects, and the amount of time it took to run the code.

Towards the end of my project when I needed to run code that would take several hours to complete, I decided to do everything on a local jupyter runtime. Although I had to change the format of how I was pulling in files, in the end, this way was much better. I was able to leave my code running overnight without keeping the session active and I did not have to worry about the session disconnecting. I also found that my code ran much faster. This allowed me to run my models over more epochs than I was originally able to do.

Data Cleaning

Genre Mapping

What do my genres (summaries) and categories (book covers) look like?

Needed to condense my genres and categories so that I had the same number for each. This will make it easier to compare the accuracies between how the two models perform. For genres, I went from 227 genres mapped to 7 genres. For categories, I went from 22 categories mapped to 7 categories. Below are the mappings that I used as well as the counts for each genre/category. This is an area that could use some improvement. Not only was I doing my best to match each genre/category into groups that would fit the best, I also tried to get an even distribution between all of the groups.

Specifically for the book summaries, each book had multiply genres attached to it. I split each book into multiple rows, so that each new row only had one genre attached to it. I tried two different data mapping options for the book summary data since I was not happy with the results I was getting in my RNN/GRU. For the book covers, each book cover only had one category attached to it so I did not need to worry about this for the book cover data.

Option 1: I removed any genres that had less than 500 books for each category. This left me with 13 genres, which I then mapped to 7 genres.
Option 2: I exported my data frame and did all of the mapping in excel. I mapped all 227 genres down to 7 genres. This allowed me to keep a little more data than option 1.

Summary Genre Mapping - Count

Option 1

Fiction - 4,747: Fiction
Other - 4,354: Children's literature, Young adult literature, Historical novel, Crime Fiction
Speculative fiction - 4,314: Speculative fiction
Mystery - 3,240: Mystery, Suspense, Thriller, Horror
Science Fiction - 2,870: Science Fiction
Novel - 2,463: Novel
Fantasy - 2,413: Fantasy

Option 2

Children's - 2,123: "Boys' school stories" "Children's literature" 'English public-school stories'
Fantasy - 6,970: 'Bangsian fantasy' 'Cabal' 'Comic fantasy' 'Contemporary fantasy' 'Dark fantasy' 'Fable' 'Fairy tale' 'Fairytale fantasy' 'Fantastique' 'Fantasy' 'Fantasy of manners' 'Heroic fantasy' 'High fantasy' 'Historical fantasy' 'Juvenile fantasy' 'Low fantasy' 'Magic realism' 'Science fantasy' 'Speculative fiction' 'Superhero fiction' 'Urban fantasy' 'Vampire fiction' 'Zombie' 'Zombies in popular culture'
Literary Fiction - 6,388: 'Absurdist fiction' 'Adventure' 'Adventure novel' 'American Gothic Fiction' 'Anti-nuclear' 'Anti-war' 'Bildungsroman' 'Bit Lit' 'Campus novel' 'Catastrophic literature' "Children's literature" 'Collage' 'Coming of age' 'Conspiracy' 'Conspiracy fiction' 'Cozy' 'Ergodic literature' 'Experimental literature' 'Fiction' 'Fictional crossover' 'First-person narrative' 'Gay novel' 'Gay Themed' 'Gothic fiction' 'Industrial novel' 'Inspirational' 'Künstlerroman' 'LGBT literature' 'Light novel' 'Literary fiction' 'Literary realism' 'Literary theory' 'Mashup' 'Modernism' 'Naval adventure' 'New Weird' 'Parallel novel' 'Reference' 'Robinsonade' 'Roman à clef' 'School story' 'Sea story' 'Social criticism' 'Social novel' 'Transgender and transsexual fiction' 'Urban fiction' 'Western fiction' 'Wuxia' 'Young adult literature' 'Youth'
Mystery - 4,720: 'Albino bias' 'Crime Fiction' 'Detective fiction' 'Drama' 'Ghost story' 'Hardboiled' 'Historical whodunnit' 'Horror' 'Locked room mystery' 'Mystery' 'Psychological novel' 'Serial' 'Social commentary' 'Spy fiction' 'Supernatural' 'Suspense' 'Techno-thriller' 'Thriller' 'Whodunit'
Non-Fiction - 2,419: 'Alternate history' 'Anthropology' 'Autobiographical comics' 'Autobiographical novel' 'Autobiography' 'Biographical novel' 'Biography' 'Biopunk' 'Business' 'Computer Science' 'Cookbook' 'Creative nonfiction' 'Economics' 'Education' 'Field guide' 'Foreign legion' 'Future history' 'Historical fiction' 'Historical novel' 'History' 'Literary criticism' 'Marketing' 'Mathematics' 'Memoir' 'Military history' 'Nature' 'Neuroscience' 'Non-fiction' 'Non-fiction novel' 'Pastiche' 'Philosophy' 'Photography' 'Police procedural' 'Political philosophy' 'Politics' 'Popular culture' 'Popular science' 'Post-holocaust' 'Psychology' 'Religion' 'Religious text' 'Science' 'Self-help' 'Social sciences' 'Sociology' 'Spirituality' 'Sports' 'Transhumanism' 'Travel' 'Travel literature' 'True crime' 'War novel' 'Western'
Other - 4,003: 'Anthology' 'Black comedy' 'Chick lit' 'Chivalric romance' 'Colonial United States romance' 'Comedy' 'Comedy of manners' 'Comic book' 'Comic novel' 'Comics' 'Edisonade' 'Elizabethan romance' 'Encyclopedia' 'Epistolary novel' 'Erotica' 'Essay' 'Farce' 'Gamebook' 'Georgian romance' 'Graphic novel' 'Historical romance' 'Humour' 'Indian chick lit' 'Medieval romance' 'Morality play' 'Music' 'New York Times Best Seller list' 'Novel' 'Novella' 'Paranormal romance' 'Parody' 'Personal journal' 'Picaresque novel' 'Picture book' 'Planetary romance' 'Play' 'Poetry' 'Polemic' 'Pornography' 'Prose' 'Prose poetry' 'Regency romance' 'Role-playing game' 'Romance novel' 'Romantic comedy' 'Satire' 'Scientific romance' 'Short story' 'Tragicomedy' 'Treatise'
Science Fiction - 3,381: 'Alien invasion' 'Apocalyptic and post-apocalyptic fiction' 'Comic science fiction' 'Cyberpunk' 'Dying Earth subgenre' 'Dystopia' 'Epic Science Fiction and Fantasy' 'Existentialism' 'Feminist science fiction' 'Hard science fiction' 'Human extinction' 'Invasion literature' 'Lost World' 'Metaphysics' 'Military science fiction' 'Postcyberpunk' 'Postmodernism' 'Science Fiction' 'Social science fiction' 'Soft science fiction' 'Space opera' 'Space western' 'Steampunk' 'Subterranean fiction' 'Sword and planet' 'Sword and sorcery' 'Time travel' 'Utopian and dystopian fiction' 'Utopian fiction'

Book Cover Category Mapping - Count

Help Books - 34,992: Arts & Photographs, Cookbooks & Food & Wine, Crafts & Hobbies & Home, Education & Teaching, Parenting & Relationships, Self-Help, Test Preparation
Literary Fiction - 26,497: Gay & Lesbian, Literature & Fiction, Mystery & Thriller & Suspense, Romance, Science Fiction & Fantasy, Teen & Young Adult
Non-Fiction - 35,017: Biographies & Memoirs, Business & Money, History, Law, Politics & Social Sciences, Reference
Other - 26,163: Calendars, Children's Books, Comics & Graphic Novels, Humor & Entertainment
Outdoorsy - 36,192: Health & Fitness & Dieting, Sports & Outdoors, Travel
Religious - 16,698: Christian Books & Bibles, Religion & Spirituality
Technical - 32,013: Computers & Technology, Engineering & Transportation, Medical Books, Science & Math

Other Data Cleaning

Book Summaries

Removed null values
Converted summaries to lowercase
Removed stop words using the English stopwords from NLTK
Removed punctuation

Book Covers

Removed null values
Take a random sample of 5,000 images from each category to use as our final dataset
Resize all of the images to one size

Sample of Data

Data Exploration

Book Summaries - Word Clouds

Option 1

Common words in each genre:

Fiction: time, life, new, father, family
Fantasy: find, time, world, king, life, city
Speculative Fiction: time, new, back, find, world, life
Mystery: man, death, find, time, police, house
Novel: life, time, story, novel, new, family
Other: father, time, back, find, mother, new
Science Fiction: time, world, planet, earth, ship, human

Two of the genres, Mystery and Science Fiction, you can see a clear distinction in the words that define those particular genres. I believe this is because the topics of those genres are very clear compared to genres, like Literary Fiction, which can be more vague and covers many different topics.

There are also some common words in all of the genres: 'one', 'time', 'two', and 'also'.

Option 2

Common words in each genre:

Fantasy: world, new, back, find, life
Literary Fiction: family, father, back, life, new
Mystery: finds, house, man, death, police
Other: life, story, time, novel, first
Science Fiction: time, world, life, earth, world, ship, new
Non-Fiction: time, life, family, two, book
Children's: two, back, find, home, house

Neural Networks

Book Summary Models

Option 1

Simple RNN

I ran my models over 10 epochs. This is because of how long it took for each model to run. If I was able to run the model for longer, then there is a chance that the results would improve.

If you were to randomly guess what genre a book would be, you would have a 1 in 7 chance (~14%). The below model has a validation accuracy of ~19%. This is marginally better than the random selection which indicates the model did not learn particularly well if at all.

I tried different variations of the simple RNN to see if I could improve the results, however for each iteration the results were around the same.

Simple GRU

What is a GRU?: A GRU is a "Gated Recurrent Unit" that "aims to solve the vanishing gradient problem which comes with a standard recurrent neural network". [3] The GRU uses two gates that "can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction." [3]

For the GRU, I increased the vocab size by 1 so that I could add a padding value. By determining what the longest sequence is and padding to the right of each item, I was able to make all of the items the same length. This allowed me to stack them all into a single tensor.

The results were around 30% at the start but progressively got worse. This indicates that the model did not learn particularly well either and simply beat random selection at the beginning.

Option 2

I ran my model over 10 epochs and ran the same original RNN and GRU as I did in option 1. My results did not improve enough for me to consider the models successful. Although they performed slightly better than Option 1, the models still did not learn at all.

Simple RNN

Simple GRU

Book Cover Models

I ran my model over 20 epochs. I created a simple CNN and below are the results of it. I'm not very familiar with CNNs so I tried running the model two different ways.

Simple CNN - Option 1

For this CNN, I went 16-32-64-1024. Once again, if you were to randomly guess what genre a book would be, you would have a 1 in 7 chance (~14%). The validation accuracy for this model was 40%, which is doing significantly better than guessing.

Simple CNN - Option 2

For this CNN, I went 1024-64-32-16. The validation accuracy for this model was 26%, which is still better than guessing, but worse than the above option.

Option 1 performed much better than option 2, not only in accuracy but also in the amount of time it took to run each epoch. In option 1, it only took on average about 23s to run each epoch. In option 2, it took on average about 7 minutes to run each epoch.

Comparing Results

Overall, the first CNN model outperformed all of the other models. There are a few reasons why I believe the CNN performed the best.

I was able to run the CNNs over more epochs. This was beneficial for the CNNs because the CNNs continued to learn over time, compared to the RNNs that stopped learning and begun to overfit.
There was also slightly more data for the book covers compared to the book summaries. However, the more important difference is that the book covers had a even distribution across all of the genres. Each genre had 5,000 book covers for the CNNs split between testing and training data. For the book summary data, was a large discrepancy between the genre with the largest number of books and the smallest number of shots. By not having an even distribution across all of the genres, the model in option 1 could pick up on that the book has almost a twice amount of chance of being a Fiction (4,747 books) genre compared to a Fantasy (2,413 books) genre without even looking at the summary.
Another thing to note is that when looking at the word clouds for the book summary data, all of the genres had common words: 'one', 'time', 'two', and 'also'. These common words were some of the top words for every single genre. This goes to show that the summary data doesn't have a big distinction in words across all of the genres, especially when looking at the top words for each genre. When looking at some of the other top words further down the list you start to see more of a distinction. This could be another reason why the RNN is not performing the best.
There was also a lot more data cleaning that needed to be done on the book summary data compared to the book cover data.
- To start, each book summary had multiple book genres attached to it, compared to the book covers that only had one genre attached to it. Each book summary had to split into multiple rows with each new row only having one genre. This means that there are book genres that have the same summary. This means that when the model is going through the summaries, it will see the same summary being classified as multiple genres, although they are different rows.
- For the mappings, the book summary data started with 227 different genres and the book cover data started with 22 genres. Both datasets were mapped down to 7 genres. There is a lot less information that would've been lost in doing the genre mapping for the book covers compared to the book summary data because it is a lot easier to map from 22 genres to 7 genres compared to 227 genres to 7 genres.
After doing more research into why book covers outperformed the book summary data, I learned the importance of colors from Kelli Horan. Book publishers put thought into "what will make covers stand out to potential readers". [2] What does each color communicate and what book genre is it most used for? [2]
- Red: excitement, passion, fear, and aggression; thriller and horror movies
- Blue: trust and mental engagement; political non-fiction and thought provoking novels
- Yellow: optimism, cheerfulness, and joy; will be used for horror and thrillers to catch readers off guard
- Green: evokes a feeling of balance and growth; fantasy or supernatural books
- Black: mystery, sophistication, and death; common color for books because it is neutral
- White: purity, innocence, and simplicity; minimalist book covers and will often rely on small graphic clues

What can I do differently?

Things that can be improved:

Collect all of the data from one resources: I used two completely different resources. Although I trust where I got my data, there could be some discrepancies on how the data is collected.
Collect enough data so that there is an even distribution across the genres. This will make sure the model isn't picking up on anything that doesn't have to do with the summaries.
Improved genre mapping:
- Either collect enough data where I don't need to perform a genre mapping, or,
- Improve how the genre mapping is done.
Make sure there is only one genre per book. Don't have multiple book summaries in each genre. If a book summary shows up, it should only show up in one genre, not multiple ones.
Improve and try different variations of the neural networks. There are many things to test with neural networks and this is an area that could be explored more.
Try other alternatives to RNNs and CNNs. For RNNs I could try Average embedding over time and for CNNs I could try graph neural networks.

References in Google Sites

[1] Haden, Jeff. “This Study of 160,000 People Reveals the Bigger the Home Library, the Smarter Kids Will Be as Adults.” Inc.com, Inc., 13 Aug. 2020, www.inc.com/jeff-haden/new-research-reveals-power-of-a-large-home-library-even-if-you-dont-read-every-book.html.

[2] Horan, Kelli. “Color Psychology And Book Covers – Why You Choose Certain Books Off The Shelf.” AmReading, 22 Aug. 2016, www.amreading.com/2016/08/21/color-psychology-and-book-covers-why-you-choose-certain-books-off-the-shelf/.

[3] Kostadinov, Simeon. “Understanding GRU Networks.” Medium, Towards Data Science, 10 Nov. 2019, towardsdatascience.com/understanding-gru-networks-2ef37df6c9be.

[4] Martinez, Viridiana Romero. “Does the Cover Tells You Something about a Book?” Medium, DataDrivenInvestor, 24 Feb. 2019, medium.datadriveninvestor.com/does-the-cover-tells-you-something-about-a-book-b6cb8d710d11.

[5] Ng, Ernest. “Keras, Tell Me the Genre of My Book.” Medium, Towards Data Science, 27 May 2020, towardsdatascience.com/keras-tell-me-the-genre-of-my-book-a417d213e3a1.

References in Jupyter Notebook

All of the references that I used in my Jupyter Notebook code are specifically noted in the cell that I ran the code. I am also including all of them here. Note, some of the code is directly from Edward Raff's Modern Practical Deep Learning class that I took in Spring 2020 at UMBC.

Amaratunga, Thimira. “How to Graph Model Training History in Keras.” Codes of Interest | Deep Learning Made Fun, 1 Jan. 1970, www.codesofinterest.com/2017/03/graph-model-training-history-keras.html.
Asiri, Sidath. “Building a Convolutional Neural Network for Image Classification with Tensorflow.” Medium, Medium, 22 Apr. 2021, towardsdatascience.com/building-a-convolutional-neural-network-for-image-classification-with-tensorflow-f1f2f56bd83b.
collarblindcollarblind 3, et al. “Python Remove Stop Words from Pandas Dataframe.” Stack Overflow, 1 Dec. 1963, stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe.
“Counting the Frequencies in a List Using Dictionary in Python.” GeeksforGeeks, 18 May 2020, www.geeksforgeeks.org/counting-the-frequencies-in-a-list-using-dictionary-in-python/.
data_persondata_person 2, et al. “Remove Punctuations in Pandas.” Stack Overflow, 1 June 1965, stackoverflow.com/questions/39782418/remove-punctuations-in-pandas.
“How to Create a Vocabulary for NLP Tasks in Python.” KDnuggets, www.kdnuggets.com/2019/11/create-vocabulary-nlp-tasks-python.html.
Iamhungundji. “Book-Summary-Genre-Prediction.” Kaggle, Kaggle, 1 Jan. 2020, www.kaggle.com/iamhungundji/book-summary-genre-prediction/notebook.
Ishan DixitIshan, et al. “Split Image Dataset into Train-Test Datasets.” Stack Overflow, 1 Apr. 1968, stackoverflow.com/questions/57394135/split-image-dataset-into-train-test-datasets.
“Export Pandas to Dictionary by Combining Multiple Row Values.” Data Science Stack Exchange, 1 Feb. 1967, datascience.stackexchange.com/questions/32328/export-pandas-to-dictionary-by-combining-multiple-row-values.
Uchidalab. “Uchidalab/Book-Dataset.” GitHub, github.com/uchidalab/book-dataset/tree/master/scripts.

Page updated

Report abuse