I went through the Data analyst nanodegree program of Udacity. I worked on a few projects there and I will be writing blog posts about them in the coming weeks.

Note: This blog post is the first part of a whole series of blogposts where I describe a whole dataset analysis. The aim is to showcase how simple data analysis can be.

Introduction

Are you interested in what goes into making a successful movie? In this series of blog posts, we'll be using data from The Movie Database (TMDb) to explore the factors that contribute to a film's popularity, ratings, and revenue. Our dataset includes over 5,000 movies, with information on budget, cast, director, keywords, runtime, genres, production companies, release date, and more.

In this first post, we'll take a closer look at the TMDB movie data and consider some of the questions we'll be answering in the coming weeks. For example:

  • How has movie popularity changed over the years?
  • How does revenue vary across different rating and popularity levels?
  • What characteristics are associated with high-popularity movies?
  • How many movies are released each year?
  • What are the keyword trends by generation?

Using tools like Numpy, Pandas, and Matplotlib, we'll dive into the data and see what insights we can uncover. But before we get started, let's take a moment to introduce the dataset and talk about what it contains.

LET'S GO!!

First of all, let's import the needed packages

import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

%matplotlib inline

Data Wrangling

General Properties

Let's load the info of the dataset

df.info()

info

The TMDB movie data includes 10866 entries and 21 columns, with data types including integers, floats, and strings. A significant number of columns have null values, as indicated by the number of entries per column. In the next step, we will examine the exact number of null records per column.

list(df.isnull().sum().items())
img

After examining the null values in the TMDB movie data, we found that several columns contain null records, including cast, homepage, director, tagline, keywords, overview, genres, and production companies. In particular, the homepage, tagline, keywords, and production_companies columns have a large number of null records. In order to move forward with our analysis, we decided to remove the tagline and keywords columns, which had a high number of null values.

Next, we will try to gather more descriptive information from the dataset.

df.describe()
img

After examining the popularity column in the TMDB movie data, we observed some outliers that appear to be valid data points. Therefore, we decided to retain the original data rather than remove these outliers.

We also noticed that the budget, revenue, and runtime columns contain many zero values. Initially, we considered the possibility that these movies were not released, but upon examining the release_year column, we found that the minimum value (1996) is a valid year and that there were no null values. This suggests that these movies were indeed released, but may have missing data for budget, revenue, and runtime. In order to determine the cause of these zero values, we will closely examine these records and try to gather more information about them.

df_budget_zero.head(3)

img

Then for the revenue

df_revenue_zero.head(3)

img

After investigating the records with zero values for budget and revenue, we found that these values were likely missing data rather than indicating that the movies were not released. However, we also discovered that some of these records had other inconsistencies or missing data that could potentially affect the results of our analysis. As a result, we decided to drop these records rather than impute the missing values or set them to zero.

Next, we will check the number of null values in the dataset to determine whether we should drop or impute these values as well.

First for the budget zero values

df_budget_0count.head(2)

img

As suggested by the results, there are a lot of zero values than non-zero values. Dropping them out would corrupt the results. I better set them as null instead.

Then for the revenue zero values

df_revenue_0count.head(2)

img

Same situation. Set to null

Finally for the runtime

img

The number of zeroes is negligible, they can be dropped out

Summary

In this first part of the series, we focused on preparing the TMDB movie data for analysis. We removed some columns that had a lot of null values or were not necessary for answering our research questions, and we also dropped duplicated data. We then removed null values in certain columns and replaced zero values with null values in the budget and revenue columns. Finally, we dropped any rows with a runtime of zero.


Thank you for reading! In the next part of the series, we will continue with data cleaning and begin to explore the data in more depth. Stay tuned!

Ciao 👋🏾