Building Recommendations System? A Beginner Guide

Implementing Collaborative Filtering to Recommend Anime

Hargurjeet
6 min readJul 3, 2021
Photo by Yilin Liu on Unsplash

Building recommender systems today requires specialized expertise in analytics, machine learning and software engineering, and learning new skills and tools is difficult and time-consuming. In this notebook, we will start from scratch, covering some basic fundamental techniques and implementations in Python. I build the recommendation system using the collaborative filtering technique. This would help the user to identify the content they like.

Before we get started building the recommendation system, we need to understand the following concepts which we would be using while building the recommendation system -

Collaborative Filtering

Photo by [wiki](https://unsplash.com/@eviradauscher?utm_source=medium&utm_medium=referral)

Collaborative filtering (CF) is a technique used by recommender systems. That seeks to predict the “rating” or “preference” a user would give to an item. It is widely implemented using the following methodology.

User Based Collaborative Filtering

  1. Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
  2. Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user

This falls under the category of user-based collaborative filtering. A specific application of this is the user-based Nearest Neighbor algorithm.

We build the recommendation system in this article using this technique

Item Based Collaborative Filtering

Item-item collaborative filtering is a type of recommendation system that is based on the similarity between items calculated using the rating users have given to items. It helps solve issues that user-based collaborative filters suffer from such as when the system has many items with fewer items rated.

Cosine similarity

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity. Hence Cosine 0=1, Cosine 90=0 and Cosine 45=.7071

Table Of Contents

  1. About the Dataset
  2. Loading and Preprocessing dataset
  3. Explanatory Data Analysis
  4. Data Cleaning and Identifying Relevant Data
  5. Implementing Collaborative filtering
  6. Making Anime Recommendations
  7. Summary
  8. Future Work
  9. References

№1: About the Dataset

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Content
Anime.csv
- anime_id — myanimelist.net’s unique id identifying an anime.
- name — full name of anime.
- genre — comma separated list of genres for this anime.
- type — movie, TV, OVA, etc.
- episodes — how many episodes in this show. (1 if movie).
- rating — average rating out of 10 for this anime.
- members — number of community members that are in this anime’s “group”.

Rating.csv
- user_id — non identifiable randomly generated user id.
- anime_id — the anime that this user has rated.
- rating — rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating).

№2: Loading and Preprocessing dataset

Importing all the required libraries to get us started working on this dataset

Reading the datasets using 🐼’s Library

Before I join let us rename few column to avoid confusion also improving the formatting

№3: Explanatory Data Analysis

Performing EDA to understand the data and explore the insights

Q1: Anime review’s based on ‘Type’?

Joining both the datasets, this would bring the data granularity at the same level and thus performing EDA would be easier.

I use seaborn to develop the count plot

Review Vs Media Types

The highest no of reviews are received by Amines broadcasted on TV

Q2: Which Anime has received highest no of reviews?

Grouping the dataset based on anime Name to count the no of user id’s.

Anime ‘Death Note’ seems to have received the highest no of reviews approx.(40,000) users. Hence it is by far the most rated Anime by the Anime community.

Q3: How the average rating spread based on the rating received by users.

Calculating the average rating below against each anime

Calculating number of rating against each anime

Bring number of rating and avg rating to within a single DataFrame

Most of the Anime seems to have been rated between 4 to 8. Also a large amount of

Q4: Highly rated anime series based on community members?

Death Note seems to have the highest community members followed by Shingeki no Kyojin

Q5: Medium of streaming?

TV’s seems to be the primary source or the medium of choice among the Anime lovers

Q6: Most common genre among the Anime?

Some of the popular genre I can figure our is action, comedy, fantasy, sci-fi, adventure

№4: Data Cleaning and Identifying Relevant Data

Replacing the rating of -1 with zero’s

Identifying the nulls and removing those records within the dataset

Renaming the column names for simplicity

Due to computation limitations, Let us pick the user who have provided at least 200 reviews within the whole dataset

Merging the above DataFrame with the complete dataset to filter out the user who have provided at least 200 reviews

№5: Implementing Collaborative filtering

This pivot table consists of rows as title and columns as user id, this will help us to create sparse matrix which can be very helpful in finding the cosine similarity

Implementing Cosine similarity and applying nearest neighbors machine learning algorithm. While predicting I pass the neighbour value 6. This would identify the 6 nearest Anime’s that the user might like. Isn't this 🆒

№6: Making Anime Recommendations

Predicting anime on Row 5

Above you can see the 6 nearest Anime along with the distance their distance.

№7: Summary

I summarize the my entire notebook as follows

  • We downloaded the Anime Dataset dataset from Kaggle.
  • We ran EDA and analyzed the input features.
  • We then performed the data cleaning and filtered out on the relevant user data.
  • We implemented collaborative filtering and made recommendations.
source: https://giphy.com

№8: Future Work

  • Implementing User based collaborative filtering and evaluate the results.
  • Implement content based collaborative filtering and compare the results with the current notebook.

№9: Reference

I really hope you guys learned something from this post. Feel free to clap if you like what you learnt. Let me know if there is anything you need my help with.

source: https://giphy.com

--

--

Hargurjeet

Data Science Practitioner | Machine Learning | Neural Networks | PyTorch | TensorFlow