Before starting this series of articles on Machine Learning, I thought it might be a good idea to go through some Statistical recalls. This first article is an introduction to some more detailed articles on statistics. I will be illustrating some concepts using Python codes.
Linear Regression in 1 dimension
The most basic statistical foundation of machine learning models is the following : . This simple equation states the following :
- we suppose we have observations of a dataset, and we pick on
- is the output of this observation called the target
- is called a feature and is an independent variable we observe
- is the real model that states the link between the features and the output
- is the noise of the model. The data we observe usually do not stand on a straight line, because there are variations of the measure in real life.
The framework we consider here is pretty simple. We only have one feature per observation. In real life, we do usually have several features per observation. An example of this might be the following: you work as a data scientist in an insurance company. You would typically have several pieces of information on each customer: name, address, age, family, car, salary…
Suppose that we have observations, and for each observation i, we have one feature .
Our role is to identify the link between and in order to be able to predict . To build our prediction, we need to identify :
- a model, which is the link between and the features, represented here by the function in : . In real life, this model is unknown in real life and is estimated by : where the hat describes an estimation. We try to estimate that corresponds to real life .
- we also need to identify parameters, which will help our prediction to get as close as possible to real datas. The parameters are usually set in order to minimize the distance between our model and the datas, i.e they are set so that the derivative of a loss function (for example : is equal to 0.
The true model and the true parameters in real life are unknown. For this reason, we make a selection among several models (linear or non-linear). The model selection is usually based on an exploratory data analysis. An article will further develop this question.
From now on, we will focus on the most basic model called the unidimensional linear regression. This model states that there is a linear link between the features and the dependent variable.
The true model we expect is the following :
- is called the intercept, it is a constant term
- is the coefficient associated with . It describes the weight of on the final output.
- is called the residual. It is a white noise term that explains the variability in real life datas.
The hypothesis on are :
- , i.e a white noise condition
- for all i = 1,…,n, i.e a homoskedasticity condition
However, the true parameters remain unknown as these are the ones we expect to estimate. Therefore, the model we estimate is the following : .
A classic example would be to try to predict the average housing market price in the economy based on the GDP for example. We would expect pretty much a linear relationship between both. You can find the file I will be using here
import pandas as pd import numpy as np import matplotlib.pyplot as plt import sklearn.linear_model as lm import math from scipy import stats df = pd.read_csv('housing.txt', sep=" ") df
#Scatter plot plt.figure(figsize=(7,5)) plt.scatter(df['gdp'], df['house']) #On ajoute les titres des axes plt.xlabel("Gross Domestic Product)", fontsize = 12) plt.ylabel("Housing price", fontsize = 12) plt.title("Relation between GDP and Average housing prices") plt.show()
Our intuition was right. The relationship looks linear between the GDP and the average housing price. How do we move on from here?
Now that we know the relationship looks linear, the next step is to estimate the coefficients in order to draw a line that fits our datas. In the linear regression, estimating the parameter means identifying the Betas : so that they minimize the distance with the real datas :
This approach is known as the Ordinary Least Squares (OLS). We minimize the square of the distance between each data point and our estimate. Why the square? Because the derivative is easier to find, and we take both positive and negative deviations into account.
The solution of the OLS problem in 1 dimension is the following :
We can compute those results in Python :
# Slope Beta 1 : x = df['gdp'] x_bar = mean(df['gdp']) y = df['house'] y_bar = mean(df['house']) beta1 = ((x - x_bar)*(y-y_bar)).sum() / ((x-x_bar)**2).sum() print("Beta_1 coefficient estimate : " + str(round(beta1,4)))
beta0 = y_bar - beta1 * x_bar print("Beta_0 coefficient estimate : " + str(round(beta0,4)))
We can plot the estimated linear regression to assess whether it looks good or not :
#Scatter plot plt.figure(figsize=(7,5)) plt.scatter(df['gdp'], df['house']) plt.plot(df['gdp'], beta1 * df['gdp'] + beta0, c='r') #On ajoute les titres des axes plt.xlabel("Gross Domestic Product)", fontsize = 12) plt.ylabel("Housing price", fontsize = 12) plt.title("Relation between GDP and Average housing prices") plt.show()
It looks great ! There is a pretty efficient way to do this using Scikit Learn built-in functions.
x_1 = df[['gdp']] skl_linmod = lm.LinearRegression(fit_intercept = True).fit(x_1,y) beta1_sk = skl_linmod.coef_ beta0_sk = skl_linmod.intercept_
This will head the same results. Alright, so far we are satisfied by how our estimate looks like. But how do we assess how good our estimate is?
Assess the accuracy of the estimate using R2
A common metric used to assess the overall fit of our model is the metric (pronounced R squared). The measures the percentage of the variance of the datas we manage to explain with our model :
r_carre = 1 - (((y-(beta0 + beta1 * x))**2).sum())/((y-y_bar)**2).sum()
The takes values between 0 and 1. The closer we are to 1, the more variance we tend to explain. However, adding variables will always make the increase, even if the variable we added is not related to the output at all. I do invite you to check the adjusted - if you would like to know more about how to handle this issue.
Are coefficients significant?
Alright, we have now fitted our regression line and estimated the goodness of the fit. But all the variables included in the model might not be worth including? What if for example is not significantly different from 0 ?
Both parameters can be described by three metrics :
- an expectation
- a bias
- a variance
The expectation of the parameter corresponds… to its expected value. Nothing really new here :
The bias corresponds to how far we are from the actual value. It is given by : . The true bias is typically unknown, as we try to estimate . If the bias is 0, we say that the estimator is unbiased.
The variance defines the stability of our estimator regarding the observations. Indeed, the features might be highly spread, which would mean a pretty big variance.
It can be shown that the variance of is given by :
And the one of by :
Where the estimated variance is defined by : . This is an unbiaised estimator of the variance.
Now, we do have all the elements required to compute the standard errors of our parameters :
sigma2 = math.sqrt(np.var(y)) sigma_beta1 = math.sqrt(sigma2 / ((x-x_bar)**2).sum())
sigma_beta0 = math.sqrt(sigma2 * (1/len(y) + (x_bar**2)/((x-x_bar)**2).sum()))
Graphically, biais and variance can be represented this way :
Why are we spending time on those metrics? Those metrics allow us to compute what we call test hypothesis.
For each parameter, we want to test whether the parameter in question has a real impact on the output or not, to avoid adding dimensions that bring no significant information. In the linear regression : , it would mean testing whether the Betas are significantly different from 0 or not.
To do so, we proceed to a statistical test. If our aim is the state if the parameter is significantly different from 0, we are doing a test with : the null hypothesis : and the alternative hypothesis : .
Some further theory is needed here : Recall the Central Limit Theorem. converges to as n tends to infinity if is knowm.
In case is unknown, Slutsky’s Lemma states that converges to if converges to .
Most of the time, is unknown. From this point, it can be shown that : where and is the degrees of freedom (p is the dimension, equal to 1 here).
This metric is called the T-Stat, and it allows us to perform a hypothesis test. The 0 in the numerator can be replaced by any value we would like to test actually.
How to interpret the T-Stat?
The T-Stat should be compared with the Critical Value. The critical value is the quantile of the corresponding Student distribution at a given level of . If a coefficient is significant at a level , this means that the T-Stat is above or under the quantiles of the Student Distribution.
Another interpretation is that the probability that the coefficient estimate is not in the interval is smaller than . This probability is called the p-value and is defined by :
Thus, in a null test, we do reject at a level (typically 5%) if and only if the p-value is smaller than , i.e the probability to be in the interval around 0 is really small.
from scipy import stats t_test = beta1 / sigma_beta1 print("The T-Stat is : " + str(round(t_test,4))) p = (1 - stats.t.cdf(abs(t_test), len(df)-2)) * 2 print("The p-value is : " + str(round(p,10)))
The T_stat is : 38.3198
The p-value is : 0.0
In our example, the T-Stat is pretty big, bigger than the corresponding quantile at 5% of the Student distribution, and the p-value is by far smaller than 5%. Therefore, we reject the null hypothesis : and we conclude that the parameter is not null.
Confidence Interval for the parameters
Finally, a notion one should get familiar with is the notion of confidence interval. Using the CLT, one can set a confidence interval around an estimate of a parameter.
The lower bound and the upper bound are determined by the critical value of the student distribution at a level , and by the standard deviation of the parameter.
t_crit = stats.t.ppf(0.975,df=len(y) - 2) c0 = beta1 - t_crit * sigma_beta1 c1 = beta1 + t_crit * sigma_beta1
The same process can be done for the parameter
Type I and Type II risks
Rejecting or not a hypothesis comes at a cost: one could misclassify a parameter.
The Type I risk is the probability to reject whereas it is true. The Type II risk is the probability to not reject whereas it is false.
The Level of a test is 1 - Type I risk, and represents the probability to not-reject when is true.
Wikipedia summarizes this concept pretty well :
Confidence Interval for the model
We can define two types of confidence intervals for the model. The first one is the estimation error confidence interval. We need to take into account the error around and around :
Moreover, the prediction risk is different. The prediction risk corresponds to the risk associated with the extension of our model onto a new prediction : . We have a new source of volatility : . And this leads to a modified confidence interval :
These two confidence intervals can be plotted on the previous graph :
plt.figure() y = df['house'] plt.scatter(df['gdp'], df['house'], label = "Data", color = "green") plt.plot(df['gdp'], beta0 + beta1 * x, label = "Regression", color = "black") plt.fill_between(df['gdp'], beta0 + beta1 * x - t_crit * np.sqrt(sigma2 * (1 + 1/len(y) + (x - x_bar)**2/((x-x_bar)**2).sum())), beta0 + beta1 * x + t_crit * np.sqrt(sigma2 * (1 + 1/len(y) + (x - x_bar)**2/((x-x_bar)**2).sum())), color = 'blue', alpha = 0.1, label = '90% PI') plt.fill_between(df['gdp'], beta0 + beta1 * x - t_crit * np.sqrt(sigma2 * (1/len(y) + (x - x_bar)**2/((x-x_bar)**2).sum())), beta0 + beta1 * x + t_crit * np.sqrt(sigma2 * (1/len(y) + (x - x_bar)**2/((x-x_bar)**2).sum())), color = 'red', alpha = 0.4, label = '90% CI') plt.legend() plt.xlabel("GDP", fontsize = 12) plt.ylabel("Housing price", fontsize = 12) plt.title("Relation between GDP and Average housing prices") plt.xlim(min(x), max(x)) plt.show()
Linear Regression in 2 dimensions
So far, we have covered the unidimensional linear regression framework. But as you might expect, this is only a simple version of the linear regression model. Back to our housing price problem. So far, we only included the GPD variable. But as you may know, interest rates are also major leverage on the housing market.
The more general framework can be described as : . Notice that we are now in matrix form. Indeed :
- we suppose we have n observations of a dataset and p features per observation
- is the output of this observation
- is the column vector that contains the true parameters
- is now a matrix of dimension (nxp)
Otherwise, the conditions on remain the same.
- , i.e a white noise condition
- for all i = 1,…,n, i.e a homoskedasticity condition
In the 2 dimension problem introduced above to predict the housing prices, our task is now to estimate the following model :
The dataset I’m using can be found here.
import pandas as pd import numpy as np import math df = pd.read_csv('https://maelfabien.github.io/myblog/files/housing.txt', sep=" ") df
fig, axs = plt.subplots(2, figsize=(8,8)) axs.scatter(df['gdp'], df['house']) axs.scatter(df['interests'], df['house'], c='r') axs.set_xlabel("Gross Domestic Product", fontsize = 12) axs.set_ylabel("Housing Price", fontsize = 12) axs.set_title("Relation between GDP and Average Housing Prices") axs.set_xlabel("Interest rates", fontsize = 12) axs.set_ylabel("Housing Price", fontsize = 12) axs.set_title("Relation between interest rates and Average Housing Prices") plt.tight_layout() plt.show()
The linear relationship with the interest rate seems to hold again! Let’s dive into the model.
Recall that to estimate the parameters, we want to minimize the sum of squared residuals. In the higher dimension, the minimization problem is :
In our specific example,
The general solution to OLS is : . Simply remember that this solution can only be derived if the Gram matrix defined by is invertible. Equivalently, . This condition guarantees the uniqueness of the OLS estimator.
x = np.hstack([np.ones(shape=(len(y), 1)), df[['gdp', 'interests']]]).astype(float) xt = x.transpose() X_gram = x gram = np.matmul(xt, x) i_gram = inv(gram)
We can use the Gram matrix to compute the estimator :
Beta = np.matmul(inv(gram), np.matmul(xt,y)) #Display coefficients Beta0, Beta1 and Beta2 print("Estimator of Beta0 : " + str(round(Beta,4))) print("Estimator of Beta1 : " + str(round(Beta,4))) print("Estimator of Beta2 : " + str(round(Beta,4)))
Estimator of Beta0 : -1114.2614
Estimator of Beta1 : -0.0079
Estimator of Beta2 : 21252.7946
We will discuss cases in which this condition is not met in the next article.
Note that where . We say that is an orthogonal projector since it meets two conditions :
- is symmetrical :
- is idempotent :
Graphically, this projection states that we project onto the hyperplane formed by the columns of X.
This allows us to redefine the vectors of residuals : where is the identity matrix.
is an orthogonal projector on the orthogonal of .
Bias and Variance of the parameters
The bias of the parameter is defined by : . The estimate of the parameter is unbiaised.
The variance of the parameter is given by : . Therefore, for each :
#Standard errors of the coefficients sigma = math.sqrt(((y - Beta - Beta * xt - Beta*xt)**2).sum()/(len(y)-3)) sigma2 = sigma ** 2 se_beta0 = math.sqrt(sigma2 * inv(gram)[0,0]) se_beta1 = math.sqrt(sigma2 * inv(gram)[1,1]) se_beta2 = math.sqrt(sigma2 * inv(gram)[2,2]) print("Estimator of the standard error of Beta0 : " + str(round(se_beta0,3))) print("Estimator of the standard error of Beta1 : " + str(round(se_beta1,3))) print("Estimator of the standard error of Beta2 : " + str(round(se_beta2,3)))
Estimator of the standard error of Beta0 : 261.185
Estimator of the standard error of Beta1 : 0.033
Estimator of the standard error of Beta2 : 6192.519
The significance of each variable is assessed using an hypothesis test :
#Student t-stats t0 = Beta/se_beta0 t1 = Beta/se_beta1 t2 = Beta/se_beta2 print("T-Stat Beta0 : " + str(round(t0,3))) print("T-Stat Beta1 : " + str(round(t1,3))) print("T-Stat Beta2 : " + str(round(t2,3)))
T-Stat Beta0 : -4.266
T-Stat Beta1 : -0.242
T-Stat Beta2 : 3.432
Something pretty interesting happens here: by adding a new variable to the model, the coefficient of quite logically changed but became non-significant. We do expect to explain the model way better now.
The Quadratic Risk is : where are the eigenvalues of the Gram matrix.
The prediction risk is :
The Github repository of this article can be found here.
Like it? Buy me a coffee