You are currently viewing Time Series analysis in python- part 2

Time Series analysis in python- part 2

In the previous blog post, we learn about the theory of time series analysis. Now we move on to the second part of this blog which is coding. If you haven’t read the first part read it here.

In this blog post, we will solve a real-world problem using time series analysis in python.

There are some other cool problems that you can solve using time series analysis are:-

1- stock market forecasting

2- predicting the revenues for any retail store.

3- anomaly detection

4- Census Analysis

And many more

Some other blog post that you may want to read is


What is time series analysis

According to Wikipedia Time series analysis is a statistical technique that is used to deal with time-series data i.e data is in the series of a time interval or periods.

Coding of time series analysis

Step 1- Acquire the Data

This is a Boston crime dataset that is publicly available on the Kaggle platform. You can download the dataset directly from Kaggle. Here is the link. There are a total of 2 files.

1- crimes.csv which contains the criminal record from June 14, 2015, and continues to September 3, 2018. There is a total of 17 columns like INCIDENT_NUMBER, OFFENSE_CODE, OFFENSE_CODE_GROUP, OFFENSE_DESCRIPTION, DISTRICT, REPORTING_AREA, SHOOTING, OCCURRED_ON_DATE, YEAR, MONTH, DAY_OF_WEEK, HOUR, UCR_PART, STREET, Lat, Long, Location.

2- offense code.csv which contains all corresponding values that map to offense code. Like offense code 3111 is LICENSE PREMISE VIOLATION. 

Step 2- Understand the Problem

In the given problem, we have a dataset of crimes records that are happening in Boston.

If you’re unable to know the explanation for a problem it’s impossible to unravel it.

Naoto Kan

Now we have the following objectives.

First- we have to find what are the most common types of crimes that are happening in the city.

Second- we have to forecast the different types of crimes that are most likely to occur in that city.

Now we have understood the problem let us start coding and understand the things in great detail.

Step 3- Load all the required libraries

#Step 3- Importing the libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
import itertools
import sklearn
import scipy
import seaborn as sns
from matplotlib import pyplot as plt
import squarify
import matplotlib.ticker as ticker
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import sys
from sklearn.preprocessing import MinMaxScaler
import math

We load all the necessary packages to solve this problem

numpy is used for mathematical operations, pandas are used for data processing, seaborn and Matplotlib are used for plotting the graphs.

We import our ARIMA model from the stats model package, we also import the ADF test module for checking the stationarity and we also load the two plots are auto-correlation function and partial autocorrelation function.

From the sklearn library, we use min-max scaler function to perform data normalization i.e value lies in range(0,1).

Step 4- Exploratory data analysis

In this part, we are going to derive some meaningful insights from the data.

Now we import the dataset and then perform EDA on it.

#Import the dataset
df = pd.read_csv('../input/crimes-in-boston/crime.csv',encoding='latin-1')

df.drop("INCIDENT_NUMBER",axis=1, inplace=True) 

df[["DATE","TIME"]]=df['OCCURRED_ON_DATE'].str.split(" ",expand=True)


print(df.describe())

# plot line chart
def lineplt(x,y,xlabel,ylabel,title,size,tick_spacing):
    fig,ax=plt.subplots(figsize = size)
    plt.plot(x,y)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    plt.xlabel(xlabel,fontsize = 15)
    plt.ylabel(ylabel,fontsize = 15)
    plt.title(title,fontsize = 20)
    plt.show()

# Create 2 columes DateFrame
def createdf(c1,d1,c2,d2):
    dic = {c1:d1,c2:d2}
    df = pd.DataFrame(dic)
    return df

# Plot histogram
def plthis(d,bin, title):
    plt.figure(figsize=(10,8))
    plt.hist(d, bins=bin)
    plt.title(title, fontsize = 20)
    plt.show()
    
# Put Date and Count into a new Dataframe
c = createdf("Date",df["DATE"].value_counts().index,"Count",df["DATE"].value_counts())

# c is the total number of crimes per day
c.head(5)

plthis(c["Count"],50, "Crimes Count Distribution")



c=c.sort_values(by="Date",ascending = True)
lineplt(c["Date"],c["Count"],"Date","Count","Crimes by Time",(20,15),80)


fig = plt.figure(figsize=(16,16))
ax1 = fig.add_subplot(411)

fig = plot_acf(c["Count"],lags=200,ax=ax1)
plt.title('Autocorrelation Lag=200')
ax2 = fig.add_subplot(412)

fig = plot_pacf(c["Count"],lags=200,ax=ax2)
plt.title('Partial Autocorrelation Lag=200')
ax3 = fig.add_subplot(413)

fig = plot_acf(c["Count"],lags=15,ax=ax3)
plt.title('Autocorrelation Lag=15')
ax4 = fig.add_subplot(414)

fig = plot_pacf(c["Count"],lags=15,ax=ax4)
plt.title('Partial Autocorrelation Lag=15')
plt.subplots_adjust(left=None, bottom=None, right=None, top=None,
                wspace=None, hspace=0.5)
plt.show()

Line 23– Now we drop the target column i.e Incident number from our dataset.

Line 25– After that, we apply date formatting on the “Occurred Date” column.

Line 28– we are going to check the description of the dataset i.e mean, median, count, standard deviation, minimum, maximum and quantile values.

description of columns
distribution of crimes in boston city

Line 30-59 Now we find out the count of crimes that are happening in a period. And plot them.

And after that, we calculate the
value of skewness and kurtosis

skewness is -0.23642018110351762 kurtosis is 0.6916671057308545

Now we have to do two tests that are the Shapiro-Wilk test and the Kolmogorov-Smirnov test to find out the p-value. In this case, the p-value is very small means less than 5%.
So we can conclude that the distribution is significantly different from normal distribution under 95% confidence.

distribution of crimes over time

After that, we plot the distribution of crimes over time.

In this chart, we can see that there are many peaks and troughs which seem like sin function.

Line 63-86 Now we look at the autocorrelation and partial autocorrelation
(lag = 200 & lag = 15). From this we can conclude that from lag 1 to lag 100, the correlation is positive and from lag 100 to lag 200, the correlation is negative.

autocorrelation and partial autocorrelation plots

Line 87-113 Now we perform seasonal decomposition to get a clear view of the pattern of the distribution of crimes.

res = sm.tsa.seasonal_decompose(c['Count'],freq=12,model="additive")
# # original = res
trend = res.trend
seasonal = res.seasonal
residual = res.resid

fig,ax=plt.subplots(figsize = (20,15))
ax1 = fig.add_subplot(411)
ax1.xaxis.set_major_locator(ticker.MultipleLocator(80))
ax1.plot(c['Count'], label='Original')
ax1.legend(loc='best')

ax2 = fig.add_subplot(412)
ax2.xaxis.set_major_locator(ticker.MultipleLocator(80))
ax2.plot(trend, label='Trend')
ax2.legend(loc='best')

ax3 = fig.add_subplot(413)
ax3.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax3.plot(seasonal[:100],label='Seasonality')
ax3.legend(loc='best')

ax4 = fig.add_subplot(414)
ax4.xaxis.set_major_locator(ticker.MultipleLocator(80))
ax4.plot(residual, label='Residuals')
ax4.legend(loc='best')
plt.tight_layout()

def test_stationarity(series,mlag = 365, lag = None,):
    print('ADF Test Result')
    res = adfuller(series, maxlag = mlag, autolag = lag)
    output = pd.Series(res[0:4],index = ['Test Statistic', 'p value', 'used lag', 'Number of observations used'])
    for key, value in res[4].items():
        output['Critical Value ' + key] = value
    print(output)
    
test_stationarity(c['Count'],lag = 'AIC')


d1 = c.copy()
d1['Count'] = d1['Count'].diff(1)
d1 = d1.dropna()
lineplt(d1["Date"],d1["Count"],"Date","Count","Crimes by Time",(20,15),80)
print('Average= '+str(d1['Count'].mean()))
print('Std= ' + str(d1['Count'].std()))
print('SE= ' + str(d1['Count'].std()/math.sqrt(len(d1))))
print(test_stationarity(d1['Count'],lag = 'AIC'))


fig_2 = plt.figure(figsize=(16,8))
ax1_2 = fig_2.add_subplot(211)

fig_2 = plot_acf(d1["Count"],lags=15,ax=ax1_2)
ax2_2 = fig_2.add_subplot(212)

fig_2 = plot_pacf(d1["Count"],lags=15,ax=ax2_2)

Line 115-123 After that we check for the stationarity of the data by using a visual plot test or Augmented Dickey-Fuller test or KPSS test. If you don’t know about these I will recommend you go through the previous blog post where I explain the things in great detail.

Line 126-133 Now we know that the data is not stationary, so we have to make the data stationary by using a simple moving average or by differencing the date column by 1.

Line 136-142 we plot the auto-correlation plot and partial auto-correlation plot.

autocorrelation and partial autocorrelation plot

Step 5- Implement the model

In this section, we are going to implement the ARIMA model with alpha value .05. We are forecasting 1-year results and plot them in the graph.

timeseries = c['Count']
p,d,q = (4,1,2)
arma_mod = ARMA(timeseries,(p,d,q)).fit()
summary = (arma_mod.summary2(alpha=.05, float_format="%.8f"))
print(summary)

predict_data = arma_mod.predict(start='2016-07-01', end='2017-07-01', dynamic = False)
timeseries.index = pd.DatetimeIndex(timeseries.index)
fig, ax = plt.subplots(figsize=(20, 15))
ax = timeseries.plot(ax=ax)
predict_data.plot(ax=ax)
plt.show()
Summary of dataset
final result

By looking at the prediction of 1-year data, the yellow line is the prediction of daily crimes. It looks the predictions are always underestimated.

Now we have solved the above objectives

Ans1-The frequency of crimes looks like a normal shape distribution. But it doesn’t pass the Shapiro-Wilk test and Kolmogorov-Smirnov test, so it’s Significantly Different from a normal distribution.

Ans2– It’s possible to forecast the daily frequency of crimes using the ARIMA model, but due to the seasonality, the forecasting model is not perfect.

Wrap up the Session

Now we have understood the code step by step in great detail. Now the task for you 

guys to implement this time series code in your model.

The code is available on the GitHub repository as usual.

So if you like this blog post, please like it and subscribe to our data spoof community to get real-time updates. You can follow our, Facebook page to get notifications whenever we upload any post so you can never miss any update from us.