Project Description

Can we predict if a movie was excellent, good, or bad based on the year it was made, its genre, director, budget, and duration?

Using multiple linear regression, deep learning, random forest, and SVM to train a model to predict if a movie was excellent, good, or bad. This could be helpful to uncover what’s important in making an excellent movie, do some genres result in more popular movies? What about the director’s talent or the budget? These are all inputs used in four models to predict movie ratings in the IMDb.

IMDb Methodology

Data for IMDb was pulled from Kaggle. ETL was performed to clean the data. Removed movies that were missing information. Filtered for movies only from the United States. Removed movies that were older than 2000. In addition, the removal of commas and dollar signs was needed for the data to be usable in the models.

Rating Categories

Movies are rated on the IMDb from 1 to 10. Individual votes are aggregated based on a user's gender and age. A median vote is generated and available in the database. Median vote is simplified to generate the rating scheme used: excellent, good, or bad.

Excellent movies have a rating of 8 up to 10

Good movies have a rating between 4.1 and 7.9

Bad movies have a rating of 4 or less

Machine Learning Models

Multiple models were used to find which model would generate the most accurate prediction.

Independent Variables: year, genre, duration, director, budget

Dependent Variable: rating

Multiple Linear Regression was not possible due to most of the data being categorical, and the errors resulting after data was pre-processed and scaled to be usable in the model the errors were so large that this model is not useful.

Deep learning was a successful model in predicting the rating of a movie. Deep learning is the addition of layers in a neural network and flow in one direction from input to output.

Random forest was a successful model in predicting the rating of a movie. Random forest is helpful for predicting the classification an observation belongs to, which is what is being done in this prediction. Random forest consists of a large number of individual decision trees that operate together to make a prediction.

Support-vector machines (SVMs) are a set of supervised learning methods used for classification. SVM is very effective when there is a limited training dataset, but this is not a barrier for our dataset. SVM is faster than neural networks (and deep learning).