Time Series analysis in python- part 2
In the previous blog post, we learn about the theory of time series analysis. Now we move on to the second part of this blog which is coding. If you haven't read the first part read it here.
In this blog post, we will solve a real-world problem using time series analysis in python.
There are some other cool problems that you can solve using time series analysis are:-
1- stock market forecasting
2- predicting the revenues for any retail store.
3- anomaly detection
4- Census Analysis
And many more
Some other blog post that you may want to read is
Step 1- Acquire the Data
This is a Boston crime dataset that is publicly available on the Kaggle platform. You can download the dataset directly from Kaggle. Here is the link. There are a total of 2 files.
1- crimes.csv which contain the criminal record from June 14, 2015, and continue to September 3, 2018. There is a total of 17 columns like INCIDENT_NUMBER, OFFENSE_CODE, OFFENSE_CODE_GROUP, OFFENSE_DESCRIPTION, DISTRICT, REPORTING_AREA, SHOOTING, OCCURRED_ON_DATE, YEAR, MONTH, DAY_OF_WEEK, HOUR, UCR_PART, STREET, Lat, Long, Location.
2- offense code.csv which contains all corresponding values that map to offense code. Like offense code 3111 is LICENSE PREMISE VIOLATION.
Step 2- Understand the Problem
In the given problem, we have a dataset of crimes records that are happening in Boston.
If you're unable to know the explanation for a problem it's impossible to unravel it --Naoto Kan
Now we have the following objectives. First- we have to find what are the most common types of crimes that are happening in the city. Second- we have to forecast the different types of crimes that are most likely to occur in that city. Now we have understood the problem let us start coding and understand the things in great detail.
Step 3- Load all the required libraries
We load all the necessary packages to solve this problem numpy is used for mathematical operations, pandas are used for data processing, seaborn and Matplotlib are used for plotting the graphs. We import our ARIMA model from the stats model package, we also import the ADF test module for checking the stationarity and we also load the two plots are auto-correlation function and partial autocorrelation function. From the sklearn library, we use min-max scaler function to perform data normalization i.e value lies in range(0,1).
Step 4- Exploratory data analysis
In this part, we are going to derive some meaningful insights from the data.
Now we import the dataset and then perform EDA on it.
Line 3- Now we drop the target column i.e Incident number from our dataset. Line 4- After that, we apply date formatting on the "Occurred Date" column. Line 7- we are going to check the description of the dataset i.e mean, median, count, standard deviation, minimum, maximum and quantile values.
Line 9-38 Now we find out the count of crimes that are happening in a period. And plot them. And after that, we calculate the value of skewness and kurtosis skewness is -0.23642018110351762 kurtosis is 0.6916671057308545
Now we have to do two tests that are the Shapiro-Wilk test and the Kolmogorov-Smirnov test to find out the p-value. In this case, the p-value is very small means less than 5%. So we can conclude that the distribution is significantly different from normal distribution under 95% confidence.
After that, we plot the distribution of crimes over time. In this chart, we can see that there are many peaks and troughs which seem like sin function.
Line 46-71 Now we look at the autocorrelation and partial autocorrelation (lag = 200 & lag = 15). From this we can conclude that from lag 1 to lag 100, the correlation is positive and from lag 100 to lag 200, the correlation is negative.
Line 1-24 Now we perform seasonal decomposition to get a clear view of the pattern of the distribution of crimes.
Line 26-34 After that we check for the stationarity of the data by using a visual plot test or Augmented Dickey-Fuller test or KPSS test. If you don't know about these I will recommend you go through the previous blog post where I explain the things in great detail. Line 37-44 Now we know that the data is not stationary, so we have to make the data stationary by using a simple moving average or by differencing the date column by 1.
Line 47-51 we plot the auto-correlation plot and partial auto-correlation plot.
Step 5- Implement the model
In this section, we are going to implement the ARIMA model with alpha value .05. We are forecasting 1-year results and plot them in the graph.
By looking at the prediction of 1-year data, the yellow line is the prediction of daily crimes. It looks the predictions are always underestimated.
Now we have solved the above objectives Ans1-The frequency of crimes looks like a normal shape distribution. But it doesn’t pass the Shapiro-Wilk test and Kolmogorov-Smirnov test, so it's Significantly Different from a normal distribution. Ans2- It's possible to forecast the daily frequency of crimes using the ARIMA model, but due to the seasonality, the forecasting model is not perfect.
Wrap up the Session
Now we have understood the code step by step in great detail. Now the task for you
guys to implement this time series code in your model.
The code is available on the GitHub repository as usual.
So if you like this blog post, please like it and subscribe to our data spoof community to get real-time updates. You can follow our Facebook page to get notification whenever we upload any post so you can never miss any update from us.