{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Recommendation Engine\n",
"## Building a Movie Recommendation Engine using MovieLens dataset \n",
"We will be using a MovieLens dataset. This dataset contains 100004 ratings across 9125 movies for 671 users. All selected users had at least rated 20 movies. \n",
"We are going to build a recommendation engine which will suggest movies for a user which he hasn't watched yet based on the movies which he has already rated. We will be using k-nearest neighbour algorithm which we will implement from scratch."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Movie file contains information like movie id, title, genre of movies and ratings file contains data like user id, movie id, rating and timestamp in which each line after header row represents one rating of one movie by one user."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Jumanji (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Grumpier Old Men (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Waiting to Exhale (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Father of the Bride Part II (1995)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title\n",
"0 1 Toy Story (1995)\n",
"1 2 Jumanji (1995)\n",
"2 3 Grumpier Old Men (1995)\n",
"3 4 Waiting to Exhale (1995)\n",
"4 5 Father of the Bride Part II (1995)"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_file = \"data\\movie_dataset\\movies.csv\"\n",
"movie_data = pd.read_csv(movie_file, usecols = [0, 1])\n",
"movie_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userId</th>\n",
" <th>movieId</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>31</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1029</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1061</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1129</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1172</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userId movieId rating\n",
"0 1 31 2.5\n",
"1 1 1029 3.0\n",
"2 1 1061 3.0\n",
"3 1 1129 2.0\n",
"4 1 1172 4.0"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings_file = \"data\\\\movie_dataset\\\\ratings.csv\"\n",
"ratings_info = pd.read_csv(ratings_file, usecols = [0, 1, 2])\n",
"ratings_info.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" <th>userId</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>7</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>9</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>13</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>15</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>19</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title userId rating\n",
"0 1 Toy Story (1995) 7 3.0\n",
"1 1 Toy Story (1995) 9 4.0\n",
"2 1 Toy Story (1995) 13 5.0\n",
"3 1 Toy Story (1995) 15 2.0\n",
"4 1 Toy Story (1995) 19 3.0"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_info = pd.merge(movie_data, ratings_info, left_on = 'movieId', right_on = 'movieId')\n",
"movie_info.head()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" <th>userId</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>7</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>9</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>13</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>15</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>19</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title userId rating\n",
"0 1 Toy Story (1995) 7 3.0\n",
"1 1 Toy Story (1995) 9 4.0\n",
"2 1 Toy Story (1995) 13 5.0\n",
"3 1 Toy Story (1995) 15 2.0\n",
"4 1 Toy Story (1995) 19 3.0"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_info.loc[0:10, ['userId']]\n",
"movie_info[movie_info.title == \"Toy Story (1995)\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>title</th>\n",
" <th>userId</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>246</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>671</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2111</th>\n",
" <td>36</td>\n",
" <td>Dead Man Walking (1995)</td>\n",
" <td>671</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2843</th>\n",
" <td>50</td>\n",
" <td>Usual Suspects, The (1995)</td>\n",
" <td>671</td>\n",
" <td>4.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6715</th>\n",
" <td>230</td>\n",
" <td>Dolores Claiborne (1995)</td>\n",
" <td>671</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7809</th>\n",
" <td>260</td>\n",
" <td>Star Wars: Episode IV - A New Hope (1977)</td>\n",
" <td>671</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId title userId rating\n",
"246 1 Toy Story (1995) 671 5.0\n",
"2111 36 Dead Man Walking (1995) 671 4.0\n",
"2843 50 Usual Suspects, The (1995) 671 4.5\n",
"6715 230 Dolores Claiborne (1995) 671 4.0\n",
"7809 260 Star Wars: Episode IV - A New Hope (1977) 671 5.0"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_info = pd.DataFrame.sort_values(movie_info, ['userId', 'movieId'], ascending = [0, 1])\n",
"movie_info.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us see the number of users and number of movies in our dataset"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"671\n",
"163949\n"
]
}
],
"source": [
"num_users = max(movie_info.userId)\n",
"num_movies = max(movie_info.movieId)\n",
"print(num_users)\n",
"print(num_movies)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"how many movies were rated by each user and the number of users rated each movie"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"547 2391\n",
"564 1868\n",
"624 1735\n",
"15 1700\n",
"73 1610\n",
"Name: userId, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_per_user = movie_info.userId.value_counts()\n",
"movie_per_user.head()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Forrest Gump (1994) 341\n",
"Pulp Fiction (1994) 324\n",
"Shawshank Redemption, The (1994) 311\n",
"Silence of the Lambs, The (1991) 304\n",
"Star Wars: Episode IV - A New Hope (1977) 291\n",
"Name: title, dtype: int64"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users_per_movie = movie_info.title.value_counts()\n",
"users_per_movie.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function to find top N favourite movies of a user"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Grease (1978)', 'Little Mermaid, The (1989)', 'Sound of Music, The (1965)']\n"
]
}
],
"source": [
"def fav_movies(current_user, N):\n",
" # get rows corresponding to current user and then sort by rating in descending order\n",
" # pick top N rows of the dataframe\n",
" fav_movies = pd.DataFrame.sort_values(movie_info[movie_info.userId == current_user], \n",
" ['rating'], ascending = [0]) [:N]\n",
" # return list of titles\n",
" return list(fav_movies.title)\n",
"\n",
"print(fav_movies(5, 3))\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets build recommendation engine now\n",
"\n",
"- We will use a neighbour based collaborative filtering model. \n",
"- The idea is to use k-nearest neighbour algorithm to find neighbours of a user\n",
"- We will use their ratings to predict ratings of a movie not already rated by a current user.\n",
"\n",
"We will represent movies watched by a user in a vector - the vector will have values for all the movies in our dataset.\n",
"If a user hasn't rated a movie, it would be represented as NaN."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>movieId</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>...</th>\n",
" <th>161084</th>\n",
" <th>161155</th>\n",
" <th>161594</th>\n",
" <th>161830</th>\n",
" <th>161918</th>\n",
" <th>161944</th>\n",
" <th>162376</th>\n",
" <th>162542</th>\n",
" <th>162672</th>\n",
" <th>163949</th>\n",
" </tr>\n",
" <tr>\n",
" <th>userId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 9066 columns</p>\n",
"</div>"
],
"text/plain": [
"movieId 1 2 3 4 5 6 7 8 \\\n",
"userId \n",
"1 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"5 NaN NaN 4.0 NaN NaN NaN NaN NaN \n",
"\n",
"movieId 9 10 ... 161084 161155 161594 161830 161918 \\\n",
"userId ... \n",
"1 NaN NaN ... NaN NaN NaN NaN NaN \n",
"2 NaN 4.0 ... NaN NaN NaN NaN NaN \n",
"3 NaN NaN ... NaN NaN NaN NaN NaN \n",
"4 NaN 4.0 ... NaN NaN NaN NaN NaN \n",
"5 NaN NaN ... NaN NaN NaN NaN NaN \n",
"\n",
"movieId 161944 162376 162542 162672 163949 \n",
"userId \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN NaN \n",
"5 NaN NaN NaN NaN NaN \n",
"\n",
"[5 rows x 9066 columns]"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_movie_rating_matrix = pd.pivot_table(movie_info, values = 'rating', index=['userId'], columns=['movieId'])\n",
"user_movie_rating_matrix.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we will find the similarity between 2 users by using correlation "
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"from scipy.spatial.distance import correlation\n",
"import numpy as np\n",
"def similarity(user1, user2):\n",
" # normalizing user1 rating i.e mean rating of user1 for any movie\n",
" # nanmean will return mean of an array after ignore NaN values \n",
" user1 = np.array(user1) - np.nanmean(user1) \n",
" user2 = np.array(user2) - np.nanmean(user2)\n",
" \n",
" # finding the similarity between 2 users\n",
" # finding subset of movies rated by both the users\n",
" common_movie_ids = [i for i in range(len(user1)) if user1[i] > 0 and user2[i] > 0]\n",
" if(len(common_movie_ids) == 0):\n",
" return 0\n",
" else:\n",
" user1 = np.array([user1[i] for i in common_movie_ids])\n",
" user2 = np.array([user2[i] for i in common_movie_ids])\n",
" return correlation(user1, user2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We will now use the similarity function to find the nearest neighbour of a current user"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"# nearest_neighbour_ratings function will find the k nearest neighbours of the current user and\n",
"# then use their ratings to predict the current users ratings for other unrated movies \n",
"\n",
"def nearest_neighbour_ratings(current_user, K):\n",
" # Creating an empty matrix whose row index is userId and the value\n",
" # will be the similarity of that user to the current user\n",
" similarity_matrix = pd.DataFrame(index = user_movie_rating_matrix.index, \n",
" columns = ['similarity'])\n",
" for i in user_movie_rating_matrix.index:\n",
" # finding the similarity between user i and the current user and add it to the similarity matrix\n",
" similarity_matrix.loc[i] = similarity(user_movie_rating_matrix.loc[current_user],\n",
" user_movie_rating_matrix.loc[i])\n",
" # Sorting the similarity matrix in descending order\n",
" similarity_matrix = pd.DataFrame.sort_values(similarity_matrix,\n",
" ['similarity'], ascending= [0])\n",
" # now we will pick the top k nearest neighbou\n",
" nearest_neighbours = similarity_matrix[:K]\n",
"\n",
" neighbour_movie_ratings = user_movie_rating_matrix.loc[nearest_neighbours.index]\n",
"\n",
" # This is empty dataframe placeholder for predicting the rating of current user using neighbour movie ratings\n",
" predicted_movie_rating = pd.DataFrame(index = user_movie_rating_matrix.columns, columns = ['rating'])\n",
"\n",
" # Iterating all movies for a current user\n",
" for i in user_movie_rating_matrix.columns:\n",
" # by default, make predicted rating as the average rating of the current user\n",
" predicted_rating = np.nanmean(user_movie_rating_matrix.loc[current_user])\n",
"\n",
" for j in neighbour_movie_ratings.index:\n",
" # if user j has rated the ith movie\n",
" if(user_movie_rating_matrix.loc[j,i] > 0):\n",
" predicted_rating += ((user_movie_rating_matrix.loc[j,i] -np.nanmean(user_movie_rating_matrix.loc[j])) *\n",
" nearest_neighbours.loc[j, 'similarity']) / nearest_neighbours['similarity'].sum()\n",
"\n",
" predicted_movie_rating.loc[i, 'rating'] = predicted_rating\n",
"\n",
" return predicted_movie_rating"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predicting top N recommendations for a current user"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"def top_n_recommendations(current_user, N):\n",
" predicted_movie_rating = nearest_neighbour_ratings(current_user, 10)\n",
" movies_already_watched = list(user_movie_rating_matrix.loc[current_user]\n",
" .loc[user_movie_rating_matrix.loc[current_user] > 0].index)\n",
" \n",
" predicted_movie_rating = predicted_movie_rating.drop(movies_already_watched)\n",
" \n",
" top_n_recommendations = pd.DataFrame.sort_values(predicted_movie_rating, ['rating'], ascending=[0])[:N]\n",
" \n",
" top_n_recommendation_titles = movie_data.loc[movie_data.movieId.isin(top_n_recommendations.index)]\n",
"\n",
" return list(top_n_recommendation_titles.title)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"finding out the recommendations for a user"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\erchh\\Anaconda3\\envs\\tensorflow\\lib\\site-packages\\scipy\\spatial\\distance.py:644: RuntimeWarning: invalid value encountered in double_scalars\n",
" dist = 1.0 - uv / np.sqrt(uu * vv)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"User's favorite movies are : ['Shawshank Redemption, The (1994)', 'Father of the Bride Part II (1995)', 'Cast Away (2000)', 'Parent Trap, The (1998)', \"Ocean's Eleven (2001)\"] \n",
"User's top recommendations are: ['Godfather, The (1972)', 'Star Wars: Episode V - The Empire Strikes Back (1980)', 'Godfather: Part II, The (1974)']\n"
]
}
],
"source": [
"current_user = 140\n",
"print(\"User's favorite movies are : \", fav_movies(current_user, 5),\n",
" \"\\nUser's top recommendations are: \", top_n_recommendations(current_user, 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"We have built a movie recommendation engine using k-nearest neighbour algorithm implemented from scratch. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}