A step by step approach to perform data analysis with python
In the given tutorials, we will learn about how to perform data analysis with python by taking a real-life example. One of the most important skills of a data scientist is to explore data properly. As an economist says that "if you torture data long enough, it will confess to anything you had like". While doing data analysis one thing that you have to keep in mind about what its result is going to be because you must have a clear objective.
The goal is to turn data into information, and information into insight --Carley Fiorina
Some other blog post that you may want to read is
Let's get started
What is data analysis
According to Wikipedia data analysis is a process of inspecting, cleaning, transforming and modeling data to discover useful insights from it. Python is the best programming language for doing data analysis. According to Cambridge research more than 70% of the data scientist using python as a favorite tool for doing their task. Not only it's easy syntax but it also has a large repository of libraries. From data science to computer networking everybody is using python.
Steps to perform data analysis in python are
Importing the packages or libraries
Loading the data
Exploratory data analysis in python
What are the packages that we are going to use for data analysis
First and foremost the most important package is pandas. It is used for doing data analysis. The second package that we are going to use is numpy that is used to perform the mathematical operation. And the third package that we going to use is Matplotlib that is used to perform data visualization.
Lets us understand this thing by taking real-life examples
The problem is related to the film industry. As we all know that the film industry is the major source of entertainment from Netflix to Hotstar, everybody loves watching shows and movies. In 2018 the film industry has made over $41.7 billion in revenue. The question arises is that what movies make the most money at the box office is it a thriller movie or a sci-fi movie, who is playing the lead roles in the movie all that factor are responsible for making the most money.
We aim to perform data analysis on the movie dataset and answered these questions. Our dataset consists of 7000 films with 23 columns.
The columns are id, belongs_to_collection, budget of a movie, genres, homepage, imdb_id, original_language of that film, original_title, overview, popularity, poster_path, production_companies, production_countries, release_date, runtime, spoken_languages, status, tagline, title, Keywords, cast and crew.
The dataset is available on the Kaggle platform. You can download the dataset at the following link.
let us start the coding part
Step 1- The first step is to load all the required libraries.
Step 2- Load the dataset
train = pd.read_csv('train.csv')
Step 3- Showing the information about the data
Exploratory data analysis in python
There is a total of 23 columns out of which two are of float type, id is an integer type and rests all of them are object types.
Step 3- The third step is to check the first five rows of the dataset by suing head function and find out the summary of the dataset. Describe function is used for finding count, mean, standard deviation, interquartile range, minimum and maximum.
Step 4- If there are any date columns present in the dataset then we should split the timestamp column it into the month, day and year by using pandas function. In our case, it is the release date column.
If you are applying this thing on your dataset you can also split the time stamp column into weekday, hour, a minute or second as per your need.
After that, we plot the no of movies release in a year, month, day and weekday by using the Matplotlib and Seaborn function.
Step 6- The next step is to find out the null values in the dataset by using isna function.
missing=train.isna().sum().sort_values(ascending=False) sns.barplot(missing[:8],missing[:8].index) plt.show()
And then plot the top 8 column that are missing. As we can see that belongs to collection and homepage column has the most missing values around 2054 so we drop that column.
Step 7- The next step is to convert the object type column into a dictionary type by using the ast package and count the value of each column and visualize it.
Now we count the belongs_to_collection column.
As we can see there are only 604 films belong to some collections rest all are 0.
Now we plot the top 15 collections of movies.
Now we do the same thing for all other columns like tagline and keyword column and then plot them in a word cloud.
Now we find out the 20 most common production countries, 5 most commonly spoken languages and the top 10 most common genres.
From there we can find out that most movies are produced in the USA and the most common language of the movie is English. And the people most liked only those movies which contain drama and comedy.
Now we plot the movie revenue every single year to get an overview of how much much that a producer making every year.
Step 8- Now we deal with our target variable which is revenue. As you can see that in diagram revenue column is left-skewed so we apply log transformation to make it a normal curve.
Step 9- After that, we plot what is the revenue of the movie and how much budget
Now we move onto the last part of data analysis which is feature engineering
Feature engineering in python
It is a process of extracting the features from raw data using data mining techniques.
In this section, we are going to prepare the data by filling the NaN value with zeros or any suitable no, and then we perform label encoding on the categorical columns like collection_name and after that, we are going to scale the data and remove the columns that are not necessary.
For all this thing we are going to define a function prepare_data which will perform all the above task. And then we apply get_json to convert the columns into the dictionary. And then we apply the function on the dataset.
In this tutorial, we have learned about how to perform data analysis in python. In specific we have learned about exploratory data analysis, data preprocessing, Data Visualization and feature engineering.
The dataset code is available on GitHub as usual.
So if you like this blog post, please like it and subscribe to our data spoof community to get real-time updates. You can follow our Facebook page to get notification whenever we upload any post so you can never miss any update from us.