I went through the Data analyst nanodegree program of Udacity. I worked on some projects there and I will be writing blog posts about them in the coming weeks.

Note: This blog post is the first part of a whole series of blogposts where I describe a whole dataset analysis. The aim is to showcase how simple data analysis can be.

Introduction

About the dataset

The dataset is called TMDB movie data. Downloaded from this page, its original version was removed by Kaggle and replaced with a similar set of movies and data fields from The Movie Database (TMDb). It contains more than 5000 movies and their rating and basic information, including user ratings and revenue data.

A successful movie is evaluated by its popularity, vote average score(Ratings) and revenue. There are some keys that can affect the success of a movie. For example, the Budget, Cast, Director, Tagline Keywords, Runtime, Genres, Production Companies, Release Date, Vote Average, etc.

Looking at how the data is in the dataset, various questions can be asked. For example -

  • How was the popularity of a movie over the years?
  • Considering the five recent years, how is the distribution of revenue in different score rating levels?
  • How is the distribution of revenue in different popularity levels?
  • What kinds of properties are associated with movies that have high popularity?
  • What kind of properties are associated with movies that have high voting scores?
  • How many movies are released year by year?
  • What are the keywords trends by generation?

In this series of blog posts, we are going to answer the questions above using the TMDB Movie data, Numpy, Pandas, and Matplotlib.

for this blog post, we will focus on general comments about the data

First of all, let's import the needed packages

import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

%matplotlib inline

Data Wrangling

General Properties

Let's load the info of the dataset

df.info()

info

Judging from the info above, the dataset has 10866 entries and 21 columns. The types used are int, float, and string. Form the total number of entries and the number of entries per column, a lot of columns have null values. Let's check the exact number of null records per column.

list(df.isnull().sum().items())
img

Looking at the result above, we see that the columns that have null values are cast, homepage, director, tagline, keywords, overview, genres, production companies. We also see that the homepage, tagline, keywords, and production_companies have a lot of null records. I decided to get rid of taglines and keywords since they have a lot of null values.

Let's try to get more descriptive information from the dataset

df.describe()
img

If we look at the popularity column, we can find some outliers. since it has no upper bound, it is better to just retain the original data. We can see that there is a lot of zero values in the budget, revenue, and runtime columns. The first guess might be that these movies were not released but if we look at the release_year column we can notice that the minimum value (1996) is a valid year and that there were no null values. Therefore those movies were released. Maybe the zeroes mean the absence of data. However, in order to decide on that let's check closely those records

First for the budget

df_budget_zero.head(3)

img

Then for the revenue

df_revenue_zero.head(3)

img

After checking for Mr. Afonso poyart on the film Solace#Production) on Wikipedia, I noticed that the film was actually a success. WHich means that there was a successful release which also means that there was a budget. Therefore, the zero values were missing data. I would decide based on that to drop the records since this might affect the statistics of the result of my analysis.

Subsequently, let's check the number of null values to decide if the zeros should just be set as well or completely dropped out.

First for the budget zero values

df_budget_0count.head(2)

img

As suggested by the results, there are a lot of zero values than non-zero values. Dropping them out would corrupt the results. I better set them as null instead.

Then for the revenue zero values

df_revenue_0count.head(2)

img

Same situation. Set to null

Finally for the runtime

img

The number of zeroes is negligible, they can be dropped out

Summary

Remove some columns with a lot of null values and unnecessary ones for answering the questions: homepage, tagline, imdb_id, overview, budget_adj, revenue_adj.
Remove duplicated data
Remove all null values in the columns that have null values
Replace zero values with null values in the budget and revenue column.
Drop the lines with runtime == 0.


The first part ends here. If you had some good time reading this one, kindly check the second part which is about data cleaning.

Thank you for reading