IMDB Movie Dataset Analysis

For my data science course's final project I teamed up with two other colleagues, and together we decided to analyze a dataset consisting of 5,000 movies and 28 attributes. We used Python 3, and an array of data science tools and libraries. We knitted together the final report and results in an iPython (Jupyter) notebook.

I parsed the csv file containing the data and trimmed unecessary fields. I then proceeded to scrape CPI values for every year from 1913-2017, and adjust all the dollar amounts for each movie into real dollar terms with 2017 purchasing power. The python library Beautiful Soup was utilized for web scraping purposes.

Plotting libraries such as Matplotlib and Seaborn were used in our exploratory analysis. For instance, the graphs above show a break down of movie genre and the average domestic gross in USD that the genre earns.

Linear regression using the Ordinary Least Squares method was performed on the dataset to see if there was a relationship between a film's domestic gross and its attributes. By looking at the p-values in our result and using a significance level of 5%, we saw that a film's imdb score, budget, and cast facebook likes were significant variables in the outcome of a film's gross earnings.

View project



View source