You are currently viewing A step by step approach to perform data analysis with python

A step by step approach to perform data analysis with python

In the given tutorials, we will learn about how to perform data analysis with python by taking a real-life example. One of the most important skills of a data scientist is to explore data properly. As an economist says that “if you torture data long enough, it will confess to anything you had like”. While doing data analysis one thing that you have to keep in mind about what its result is going to be because you must have a clear objective.

data analysis in python tutorial

The goal is to turn data into information, and information into insight –Carley Fiorina

Some other blog post that you may want to read is

Let’s get started

What is data analysis

According to Wikipedia data analysis is a process of inspecting, cleaning, transforming and modeling data to discover useful insights from it. Python is the best programming language for doing data analysis. According to Cambridge research more than 70% of the data scientist using python as a favorite tool for doing their task. Not only it’s easy syntax but it also has a large repository of libraries. From data science to computer networking everybody is using python.

Steps to perform data analysis in python are

  • Importing the packages or libraries
  • Loading the data
  • Exploratory data analysis in python
  • Data preprocessing
  • Data Visualization

What are the packages that we are going to use for data analysis

First and foremost the most important package is pandas. It is used for doing data analysis. The second package that we are going to use is numpy that is used to perform the mathematical operation. And the third package that we going to use is Matplotlib that is used to perform data visualization.

Lets us understand this thing by taking real-life examples

Problem definition

The problem is related to the film industry. As we all know that the film industry is the major source of entertainment from Netflix to Hotstar, everybody loves watching shows and movies. In 2018 the film industry has made over $41.7 billion in revenue. The question arises is that what movies make the most money at the box office is it a thriller movie or a sci-fi movie, who is playing the lead roles in the movie all that factor are responsible for making the most money.

We aim to perform data analysis on the movie dataset and answered these questions. Our dataset consists of 7000 films with 23 columns.

The columns are id, belongs_to_collection, budget of a movie, genres, homepage, imdb_id, original_language of that film, original_title, overview, popularity, poster_path, production_companies, production_countries, release_date, runtime, spoken_languages, status, tagline, title, Keywords, cast and crew.

The dataset is available on the Kaggle platform. You can download the dataset at the following link.

let us start the coding part

Step 1- The first step is to load all the required libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
from tqdm import tqdm
from datetime import datetime
import json
from collections import Counter
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder

Step 2- Load the dataset

train = pd.read_csv('train.csv')

Step 3- Showing the information about the data

train.info()

Exploratory data analysis in python

There is a total of 23 columns out of which two are of float type, id is an integer type and rests all of them are object types.

Step 3- The third step is to check the first five rows of the dataset by suing head function and find out the summary of the dataset. Describe function is used for finding count, mean, standard deviation, interquartile range, minimum and maximum.

train.head()
train.describe() 

Step 4- If there are any date columns present in the dataset then we should split the timestamp column it into the month, day and year by using pandas function. In our case, it is the release date column.

train[['release_month','release_day','release_year']]=train['release_date'].str.split('/',expand=True).replace(np.nan, -1).astype(int)

train.loc[ (train['release_year'] <= 19) & (train['release_year'] < 100), "release_year"] += 2000

train.loc[ (train['release_year'] > 19)  & (train['release_year'] < 100), "release_year"] += 1900

releaseDate = pd.to_datetime(train['release_date']) 
train['release_dayofweek'] = releaseDate.dt.dayofweek
train['release_quarter'] = releaseDate.dt.quarter

If you are applying this thing on your dataset you can also split the time stamp column into weekday, hour, a minute or second as per your need.

After that, we plot the no of movies release in a year, month, day and weekday by using the Matplotlib and Seaborn function.

Data Visualization- data analysis in python

Step 6- The next step is to find out the null values in the dataset by using isna function.

missing=train.isna().sum().sort_values(ascending=False)
sns.barplot(missing[:8],missing[:8].index)
plt.show()

And then plot the top 8 columns that is missing. As we can see that belongs to the collection and the homepage column has the most missing values around 2054 so we drop that column.
NAN value in dataset

Step 7- The next step is to convert the object type column into a dictionary type by using the ast package and count the value of each column and visualize it.

Now we count the belongs_to_collection column.

As we can see there are only 604 films belong to some collections rest all are 0.

Now we plot the top 15 collections of movies.

import ast
dict_columns = ['belongs_to_collection', 'genres', 'production_companies',
                'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']

def text_to_dict(df):
    for column in dict_columns:
        df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )
    return df
        
dfx = text_to_dict(train)
for col in dict_columns:
       train[col]=dfx[col]
       
train['belongs_to_collection'].apply(lambda x:len(x) if x!= {} else 0).value_counts()

collections=train['belongs_to_collection'].apply(lambda x : x[0]['name'] if x!= {} else '?').value_counts()[1:15]
sns.barplot(collections,collections.index)
plt.show()

Data Visualization in pythonData Visualization

Now we do the same thing for all other columns like tagline and keyword column and then plot them in a word cloud.

from wordcloud import WordCloud
train['tagline'].apply(lambda x:1 if x is not np.nan else 0).value_counts()
plt.figure(figsize=(10,10))
taglines=' '.join(df_train['tagline'].apply(lambda x:x if x is not np.nan else ''))
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
width=1200, height=1000).generate(taglines)
plt.imshow(wordcloud)
plt.title('Taglines')
plt.axis("off")
plt.show()


keywords=train['Keywords'].apply(lambda x: ' '.join(i['name'] for i in x) if x != {} else '')
plt.figure(figsize=(10,10))
data=' '.join(words for words in keywords)
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
width=1200, height=1000).generate(data)
plt.imshow(wordcloud)
plt.title('Keywords')
plt.axis("off")
plt.show()

how to make word cloud in pythonword cloud in python

Now we find out the 20 most common production countries, 5 most commonly spoken languages and the top 10 most common genres.

# 20 most common production countries
countries=train['production_countries'].apply(lambda x: [i['name'] for i in x] if x!={} else []).values
count=Counter([j for i in countries for j in i]).most_common(10)
sns.barplot([val[1] for val in count],[val[0] for val in count])

# count of different different languages
train['spoken_languages'].apply(lambda x:len(x) if x !={} else 0).value_counts()

# 5 most common spoken languages
lang=train['spoken_languages'].apply(lambda x: [i['name'] for i in x] if x != {} else [])
count=Counter([i for j in lang for i in j]).most_common(5)
sns.barplot([val[1] for val in count],[val[0] for val in count])

# 10 most common genres
genre=train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else [])
count=Counter([i for j in genre for i in j]).most_common(10)
sns.barplot([val[1] for val in count],[val[0] for val in count])

\genre plotting

From there we can find out that most movies are produced in the USA and the most common language of the movie is English. And the people most liked only those movies which contain drama and comedy.

Now we plot the movie revenue every single year to get an overview of how much much that a producer making every year.

train['meanRevenueByYear'] = train.groupby("release_year")["revenue"].aggregate('mean')
train['meanRevenueByYear'].plot(figsize=(15,10),color="g")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Year",fontsize=20)
plt.show()

Step 8- Now we deal with our target variable which is revenue. As you can see that in diagram revenue column is left-skewed so we apply log transformation to make it a normal curve.

plot distributionplot

Step 9- After that, we plot what is the revenue of the movie and how much budget

revenue vs budget

Now we move onto the last part of data analysis which is feature engineering

Feature engineering in python

It is a process of extracting the features from raw data using data mining techniques.

In this section, we are going to prepare the data by filling the NaN value with zeros or any suitable no, and then we perform label encoding on the categorical columns like collection_name and after that, we are going to scale the data and remove the columns that are not necessary.

For all this thing we are going to define a function prepare_data which will perform all the above task. And then we apply get_json to convert the columns into the dictionary. And then we apply the function on the dataset.

def  prepare_data(df):
    df['_budget_runtime_ratio'] = (df['budget']/df['runtime']).replace([np.inf,-np.inf,np.nan],0)
    df['_budget_popularity_ratio'] = df['budget']/df['popularity']
    df['_budget_year_ratio'] = df['budget'].fillna(0)/(df['release_year']*df['release_year'])
    df['_releaseYear_popularity_ratio'] = df['release_year']/df['popularity']
    df['_releaseYear_popularity_ratio2'] = df['popularity']/df['release_year']
    df['budget']=np.log1p(df['budget'])
    
    df['collection_name']=df['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
    df['has_homepage']=0
    df.loc[(pd.isnull(df['homepage'])),'has_homepage']=1
    
    le=LabelEncoder()
    le.fit(list(df['collection_name'].fillna('')))
    df['collection_name']=le.transform(df['collection_name'].fillna('').astype(str))
    
    le=LabelEncoder()
    le.fit(list(df['original_language'].fillna('')))
    df['original_language']=le.transform(df['original_language'].fillna('').astype(str))
    
    df['_num_Keywords'] = df['Keywords'].apply(lambda x: len(x) if x != {} else 0)
    df['_num_cast'] = df['cast'].apply(lambda x: len(x) if x != {} else 0)
    
    df['isbelongto_coll']=0
    df.loc[pd.isna(df['belongs_to_collection']),'isbelongto_coll']=1
    
    df['isTaglineNA'] = 0
    df.loc[df['tagline'] == 0 ,"isTaglineNA"] = 1 

    df['isOriginalLanguageEng'] = 0 
    df.loc[ df['original_language'].astype(str) == "en" ,"isOriginalLanguageEng"] = 1
    
    df['ismovie_released']=1
    df.loc[(df['status']!='Released'),'ismovie_released']=0
    
    df['no_spoken_languages']=df['spoken_languages'].apply(lambda x: len(x))
    df['original_title_letter_count'] = df['original_title'].str.len() 
    df['original_title_word_count'] = df['original_title'].str.split().str.len() 


    df['title_word_count'] = df['title'].str.split().str.len()
    df['overview_word_count'] = df['overview'].str.split().str.len()
    df['tagline_word_count'] = df['tagline'].str.split().str.len()
    
    
    df['collection_id'] = df['belongs_to_collection'].apply(lambda x : np.nan if len(x)==0 else x[0]['id'])
    df['production_countries_count'] = df['production_countries'].apply(lambda x : len(x))
    df['production_companies_count'] = df['production_companies'].apply(lambda x : len(x))
    df['cast_count'] = df['cast'].apply(lambda x : len(x))
    df['crew_count'] = df['crew'].apply(lambda x : len(x))

    df['genders_0_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    df['genders_1_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    df['genders_2_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))

    for col in  ['genres', 'production_countries', 'spoken_languages', 'production_companies'] :
        df[col] = df[col].map(lambda x: sorted(list(set([n if n in train_dict[col] else col+'_etc' for n in [d['name'] for d in x]])))).map(lambda x: ','.join(map(str, x)))
        temp = df[col].str.get_dummies(sep=',')
        df = pd.concat([df, temp], axis=1, sort=False)
    df.drop(['genres_etc'], axis = 1, inplace = True)
    
    cols_to_normalize=['runtime','popularity','budget','_budget_runtime_ratio','_budget_year_ratio','_budget_popularity_ratio','_releaseYear_popularity_ratio',
    '_releaseYear_popularity_ratio2','_num_Keywords','_num_cast','no_spoken_languages','original_title_letter_count','original_title_word_count',
    'title_word_count','overview_word_count','tagline_word_count','production_countries_count','production_companies_count','cast_count','crew_count',
    'genders_0_crew','genders_1_crew','genders_2_crew']
    for col in cols_to_normalize:
        print(col)
        x_array=[]
        x_array=np.array(df[col].fillna(0))
        X_norm=normalize([x_array])[0]
        df[col]=X_norm
    
    df = df.drop(['belongs_to_collection','genres','homepage','imdb_id','overview','id'
    ,'poster_path','production_companies','production_countries','release_date','spoken_languages'
    ,'status','title','Keywords','cast','crew','original_language','original_title','tagline', 'collection_id'
    ],axis=1)
    
    df.fillna(value=0.0, inplace = True) 

    return df
    
    
def get_json(df):
    global dict_columns
    result=dict()
    for col in dict_columns:
        d=dict()
        rows=df[col].values
        for row in rows:
            if row is None: continue
            for i in row:
                if i['name'] not in d:
                    d[i['name']]=0
                else:
                    d[i['name']]+=1
            result[col]=d
    return result
    
    

    
train_dict=get_json(train)
test_dict=get_json(dftest)

for col in dict_columns :
    
    remove = []
    train_id = set(list(train_dict[col].keys()))
    test_id = set(list(test_dict[col].keys()))   
    
    remove += list(train_id - test_id) + list(test_id - train_id)
    for i in train_id.union(test_id) - set(remove) :
        if train_dict[col][i] < 10 or i == '' :
            remove += [i]
    for i in remove :
        if i in train_dict[col] :
            del train_dict[col][i]
        if i in test_dict[col] :
            del test_dict[col][i]
            
            
df_test['revenue']=np.nan
all_data=prepare_data((pd.concat([df_train,df_test]))).reset_index(drop=True)
train=all_data.loc[:df_train.shape[0]-1,:]
test=all_data.loc[df_train.shape[0]:,:]
print(train.shape)

Conclusion

In this tutorial, we have learned about how to perform data analysis in python. In specific we have learned about exploratory data analysis, data preprocessing, Data Visualization and feature engineering.

The dataset code is available on GitHub as usual.

So if you like this blog post, please like it and subscribe to our data spoof community to get real-time updates. You can follow our Facebook page to get notification whenever we upload any post so you can never miss any update from us.

This Post Has 3 Comments

  1. bursa escort

    Thanks to my father who shared with me about this weblog, this blog is in fact amazing. Noelyn Lars Marthe Hedvige Eugen Peadar

  2. erotik

    Hi there, just wanted to say, I loved this article. Lyda Filmer Farlay

  3. erotik

    I love reading through a post that will make people think. Also, thanks for permitting me to comment!| Myrta Aubrey Catharine

Comments are closed.