In this article, we’ll introduce the key concepts related to time series.
Start by importing the following packages :
### General import import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import preprocessing import statsmodels.api as sm ### Time Series from statsmodels.tsa.ar_model import AR from sklearn.metrics import mean_squared_error from pandas.tools.plotting import autocorrelation_plot from statsmodels.tsa.arima_model import ARIMA from statsmodels.tsa.seasonal import seasonal_decompose from statsmodels.tsa.stattools import adfuller #from statsmodels.tsa.sarimax_model import SARIMAX ### LSTM Time Series from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Dropout from sklearn.preprocessing import MinMaxScaler
Then, load the data :
df = pd.read_csv('opsd_germany_daily.csv', index_col=0) df.head(10)
Then, make sure to transform the dates into
datetime format in pandas :
df.index = pd.to_datetime(df.index)
I. Key concepts and definitions
The auto-correlation is defined as the correlation of the series over time, i.e how much the value at time depends on the value at time for all .
- The auto-correlation of order 1 is :
- The auto-correlation of order j is :
- The auto-covariance of order 1 is :
- The auto-covariance of order j is :
Empirically, the auto-correlation can be estimated by the sample auto-correlation :
To plot the auto-correlation and the partial auto-correlation, we can use
statsmodel package :
fig, axes = plt.subplots(1, 2, figsize=(15,8)) fig = sm.graphics.tsa.plot_acf(df['Consumption'], lags=400, ax=axes) fig = sm.graphics.tsa.plot_pacf(df['Consumption'], lags=400, ax=axes)
We observe a clear trend. The value od consumption at time is negatively correlated with the values 180 days ago, and positively correlated with the values 360 days ago.
2. Partial Auto-correlation
The partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. It is a regression of the series against its past lags.
How can we correct auto-correlation ? Take for example :
Therefore, if you substract the first to the second with a coefficient equal to the auto-correlation :
Therefore, if we want to make a regression without auto-correlation :
Why would we want to remove the auto-correlation?
- to derive the OLS estimator of the parameters for example
- because there is a bias otherwise since would depend on
Stationarity of a time series is a desired property, reached when the joint distribution of does not depend on . In other words, the future and the present should be quite similar. Stationary time series do therefore not have underlying trends or seasonal effect.
What kind of events makes a series non-stationary?
- a trend, i.e increasing sales over time
- a seasonality, i.e more sales during the summertime than wintertime
We usually want our series to be stationary even before applying any predictive model!
How can we test if a time series is stationary?
- look at the plots (as above)
- look at summary statistics and box plots as in the previous article. A simple trick is to cut the data set in 2, look at mean and variance for each split, and plot the distribution of values for both splits.
- perform statistical tests, using the (Augmented) Dickey-Fuller test
Let’s cover into more details the Dickey-Fuller test. To do so, we need to introduce the notion of unit root. A unit root is a stochastic trend in a time series, sometimes called a random walk with drift. If a series has a unit root, it makes it unpredictable due to a systematic pattern.
Let’s consider an autoregressive (we’ll dive deeper later in to this) :
We define the characteristic equation as :
. If is a root to this equation, then the process is said to have a unit root. Equivalently, the process is said to be integrated of order 1 : .
In other words, there is a unit root if the previous values keep having a 1:1 impact on the current value. If we consider a simple autoregressive model AR(1) : , the process has a unit root when .
If a process has a unit root, then it is non-stationary, i.e the moments of the process depend on .
A process is a weakly dependent process, also called integrated of order 0 ( ) if taking the first different of the model is enough to make the series stationary :
The Dicker Fuller test is used to assess if a unit root is present in an autoregressive process :
There is a unit root and the process is not stationary.
There is no unit root and the process is stationary.
For example, in an AR(1) model where , the hypothesis are :
The hypothesis would mean an explosive process, and is therefore not considered. When , then .
In practice, we consider the following equation :
We have and test .
Augmented Dickey-Fuller Test
The Augmented Dickey-Fuller Test (ADF) is an augmented version of the Dickey-Fuller test in the sense that it can test for a more complex set of time series models. For example, consider an ADF on an AR(p) process :
And the null hypothesis : .
Ergodicity is the process by which we forget the initial conditions. This is reached when auto-correlation of order tends to as tends to .
According to the ergodicity theorem, when a time series is strictly stationary and erdogic, and when , then
Exogeneity describes the relation between the residuals and the explanatory variables. The exogeneity is said to be strict if :
and for all t.
The exogenity is said to be contemporary when :
which is a weaker assumption, but satisfies the consistency hypothesis.
6. Long term effect
Let’s consider again the model : . In that case, we can estimate the long term effect as :
We can test the Granger causality using a Fisher test :
. Under this hypothesis, no past value of would allow to predict .
Conclusion : I hope you found this article useful. Don’t hesitate to drop a comment if you have a question.
Like it? Buy me a coffee