Book Recommendation System: Which Data Science book is the best for you?

Palak Goel
7 min readApr 15, 2023

--

A book recommendation system built from scratch using Data Analytics and Machine Learning Concepts…

Want to learn Data Science, but which book is right for you? | Image Source

Book recommendation systems are now a common development in Machine Learning and Data Analytics. As Wikipedia defines it, a recommender system is a “subclass of information filtering system” that provides suggestions for items most pertinent to a particular user. These systems are of various types such as those that follow Collaborative Filtering, Content-Based Filtering, Hybrid-Based Filtering etc. This article provides brief information on how with the help of combining various Data Analysis and Machine Learning techniques like Exploratory Data Analysis (EDA)Clustering and Web Scraping etc., an intelligent book recommendation system could be built.

  • Dataset:

— The dataset I used to build was from Kaggle. It is a .csv file. The motivation was to build a recommendation system in the genre of Data Science (Statistics, ML, Python etc.). This dataset had some of the most useful features like individual star ratings classified into 1, 2, … 5-star ratings, URLs to the books, etc. It can be downloaded from here.

— Details: This dataset consists of 18 features, giving information about the publishers, authors, star ratings, length of the book, price, language/medium, and other details like ISBN no. etc. It consists of 948 entries.

An Excel View of the Dataset
  • About the project:

This project was built using Python in Jupyter Notebook.

— Use of libraries: An extensive set of libraries was used in the making of this project, such as some popular ones like Pandas, Numpy, Matplotlib, and the more powerful ones like sci-kit learn, tfidfVectorizer, and the PyTorch module for the BERT summarizer.

— Division of Project: There were 4 levels of implementation in this project. They are given as follows:

  1. Exploratory Data Analysis and Visualization using Plotly:
  • To answer simpler questions that can help induce a bias in the user like “Do lengthier books have better reviews?”, or “Does a pricier book have better reviews?”. A comprehensive set of “reviews” is the ultimate factor that helps in picking the books obviously, so this EDA was performed while testing for the differences in reviews based on different contexts. However, the system needed more in order to fine-tune the search results for users with varied needs and interests.
A. Do lengthier books have better reviews? B. Do pricier books have better reviews?

2. Unsupervised Learning — Clustering using K-Means Algorithm

  • In simple words, Clustering is the process of dividing several data points into different/distinct groups such that these data points are similar to each other in comparison to the rest of the data points, and dissimilar to the data points of other groups as such. There are various kinds of clustering algorithms like K-Means and DBSCAN, however, the use of K-Means is demonstrated in this project.
  • K-Means is an iterative algorithm that works on the parameter of the distance between data points (Euclidean, Manhattan, Minkowski distance etc.). This distance is between a point called the centroid(s) and the rest of the data points. In every iteration, the centroid(s) is moved to an average value. This goes on until it is observed that the centroids are properly moved close enough towards and within the data groups and after a point, the position of the centroid stops changing.
  • In this project, the book titles were clustered into common themes with the help of Text Vectorization. Numerical representations are allocated to every text data. There are various models that take into account the word order and context, like Frequency-Based models which use BOW(Bag of Words), TF-IDF, and N-grams and models like Statistical Models which use Markov and Hidden Markov techniques etc.
  • Text Vectorization in this case has been implemented using the tf-idf vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
#instantiating tfidf vectorizer
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1,2))

#fitting and transforming the vectorizer
X = vectorizer.fit_transform(df['title'])
vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
vec_df
  • After this process, one can proceed to find the optimal value of K and thus implement K-Means Algorithm to allot clusters to different titles. Once all the titles have been clustered, these clusters can be visualized too.
  • I have used circlify and wordcloud to show these clusters
From left to right: Circle Packing and WordCloud in different clusters

3. Web Scraping Amazon for reviews

This turned out to be the most challenging part of the project. In order to summarize written reviews, it was necessary to extract or scrape Amazon for the reviews for every book in the dataset. The steps I followed are given as:

  1. Refer to the respective URL links in the dataset for the product
  2. Refer to the respective URL links in the dataset for the reviews
  3. Formulating a loop of sorts that will iterate through all the URLs
  4. Creating review URLs for each book in the dataset by creating a new col
  5. Scrape the first page and then put everything into the dataset
  6. Now by this level, the dataset is probably a good fit for sentimental analysis but the aim here is summarization :(
  7. More about summarization after this point!

In order to do this, it was critical to understand which URL could be used so here is one clear distinction in the URL formats:

#example URLs
product = "https://www.amazon.com/Becoming-Data-Head-Understand-Statistics/dp/1119741742/"
reviews = "https://www.amazon.com/Becoming-Data-Head-Understand-Statistics/product-reviews/1119741742/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

#the 'product_reviews' is what makes the reviews different here

A splitting criterion that would split the review URL from the rest of the URL was followed post this point. For those URLs in which there was no review_url, a null value was returned. Then a column can be added to the already existing dataset, and it can be called “review_urls”, which will set forward the path to go for summarization!

  • There was a need for automating this scraping process for all the books by defining functions since it is a tedious process. After digging around on GitHub, the scraper from Jeff James and other documentation helped! This scraper converted all its results into a Pandas data frame.

4. Automatic Summarization

— According to Wikipedia, automatic summarization is the process of summarizing or shortening a set of data computationally to generate/create a subset. This process will represent the most important/relevant information within the original content.

— There are 2 kinds I have learnt of in this context: Extractive and Abstractive. What is the difference?

  • Extractive, as it suggests, extracts the most significant sentences of text and then creates its summary. It doesn’t change the sentences themselves when summarizing.
  • Abstractive, on the other hand, is comparatively more advanced. It is so because it can interpret the context, and “power-phrase” the text in newer ways. So the summary is generated by the model, and not exactly extracted thoroughly from the original data.

However, for this project, I progressed with extractive summarization then as for the review data, the technique would be sufficient.

  • Now it was the final step of introducing the BERT Summarizer, a transformer-based Language Model that has been trained by Google and was released in 2018.
  • PyTorch must be installed in the virtual environment on which this project would be built on.
!pip install torchvision
import torch

#instantiating the BERT summarizer
from summarizer import Summarizer
bert_model = Summarizer()

#summarizing book revs
bert_summary = ''.join(bert_model(book_reviews_agg.review_text[2], ratio=0.2))
  • The line book_reviews_agg.review_text[2] is going to select the third text review from the respective modified dataset (that contains reviews and the reviews_url). The summary ratio is set up to 20% which implies the size of the summary will be up to 20% of the original review. the join method simply joins the resulting summary sentences into a single string.
  • Upon running the review_text of a book of choice, the original summary can be fed into the summarize and a minimized, summarized version of the book is thus obtained!

Thus, with the use of multiple procedures, a book recommendation that best suits the need of a user is ready. Some of the references that made this project a possibility are given below.

  • References:

— Leveraging BERT for Extractive Text Summarization on Lectures by Derek Miller

BERT: State-of-the-Art Summarizing Pre-training for Natural Language Processing

— Graph Visualizations and Hover Labels using Plotly

— Graph Visualizations and Hover Labels using WordCloud

— Graph Visualizations and Hover Labels using Circlify

Feel free to visit the following links and also see the construction of this project on my GitHub in this Git Repo!

You can connect with me on LinkedIn too

Palak Goel | Woxsen University | Student of Engineering in Data Science and Artificial Intelligence

--

--

Palak Goel

A Data Scientist in the making, with avid interest in statistics, writing, social commentary and finance.