So far, with the linear model, we have seen how to predict continuous variables. What happens when you want to classify with a linear model?

Linear Probability Model

Suppose that our aim is to do binary classification : \(y_i = \{0,1\}\). Let’s consider the model :

\[y = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k + u\]

Where \(E(u \mid X_1, ... X_k) = 0\). How can we perform binary classification with this model? Let’s start with a dataset in which you have binary observations and you decide to fit a linear regression on top of it.

Statistically speaking, the model above is incorrect :

we would need to define a threshold under which we classify as 0, and above which we classify 1
what if the values are greater than 1? Or smaller than 0?
…

Linear probability model has however one main advantage: the coefficients remain easily interpretable!

\[\Delta P(Y=1 \mid X) = \beta_j \Delta X_j\]

In other words, the impact of a coefficient can be measured as a contribution percentage to the final classification. Overall, this model needs to be adjusted/transformed to throw the predicted between values between 0 and 1. This is the main idea of logistic regression!

Logistic Regression

We are now interested in \(P(Y=1 \mid X) = P(Y=1 \mid X_1, X_2, .. X_k) = G(\beta_0 + \beta_1 X_1 + ... + \beta_k X_k)\). As you might guess, the way we define \(G\) will define the way we make our mapping.

If \(G\) is linear, this is obviously the linear regression
If \(G\) is a sigmoid : \(G(z) = \frac {1} {1 + e^{-z}}\), then the model is a logistic regression
If \(G\) is a normal transformation \(G(z) = \Phi(z)\), then the model is a probit regression

In this article, we’ll focus on logistic regression.

Sigmoid and Logit transformations

The sigmoid transformation is used to map values between 0 and 1 :

\[Sig(z) = \frac {1} {1 + e^{-z}}\]

To understand precisely what this does, let’s implement it in Python :

import numpy as np
import matplotlib.pyplot as plt
import math

def sigmoid(x):
    a = []
    for item in x:
        a.append(1/(1+math.exp(-item)))
    return a

Then, plot if for a range of values of \(X\) :

x = np.arange(-3., 3., 0.2)
sig = sigmoid(x)

plt.figure(figsize=(12,8))
plt.plot(x,sig, label='sigmoid')
plt.plot(x,x, label='input')
plt.title("Sigmoid Function")
plt.legend()
plt.show()

The inverse transform is called the logit transform. It takes values that are in the range 0 to 1 and maps them to a linear form.

def logit(x):
    a = []
    for item in x:
        a.append(math.log(item/(1-item)))
    return a

plt.figure(figsize=(12,8))
plt.plot(x,sig, label='sigmoid')
plt.plot(x,logit(sig), label='logit tranform')
plt.title("Sigmoid - Logit Function")
plt.legend()
plt.show()

The logistic regression model

Partial effect

In the logistic regression model :

\[P(Y=1) = \frac {1} {1 + exp^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}}\]

How can we interpret the partial effect of \(X_1\) on \(Y\) for example ? Well, the weights in the logistic regression cannot be interpreted as for linear regression. We need to use the logit transform :

\[\log( \frac {P(y=1)} {1-P(y=1)} ) = \log ( \frac {P(y=1)} {P(y=0)} )\] \[= odds = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k\]

We define the this ratio as the “odds”. Therefore, to estimate the impact of \(X_j\) increasing by 1 unit, we can compute it this way :

\[\frac {odds_{X_{j+1}}} {odds} = \frac {exp^{\beta_0 + \beta_1 X_1 + ... + \beta_j (X_j + 1) + ... + \beta_k X_k}} {exp^{\beta_0 + \beta_1 X_1 + ... + \beta_j X_j + ... + \beta_k X_k}}\] \[= exp^{\beta_j (X_j + 1) - \beta_j X_j} = exp^{\beta_j}\]

A change in \(X_j\) by one unit increases the log odds ratio by the value of the corresponding weight.

Test Hypothesis

To test for a single coefficient, we apply, as previously, a Student test :

\[t_{stat} = \frac {\beta} {\sigma(\beta)}\]

For multiple hypotheses, we choose the Likelihood Ratio tests. The coefficients are now normally distributed, so the sum of several coefficients follows a \(X^2\) (Chi-Squared) distribution.

The Likelihood ratio test is implemented in most stats packages in Python, R, and Matlab, and is defined by :

\[LR = 2(L_{ur} - L_r)\]

We reject the null hypothesis if \(LR > Crit_{val}\).

Important parameters

In the Logistic Regression, the single most important parameter is the regularization factor. It is essential to choose properly the type of regularization to apply (usually by Cross-Validation).

Implementation in Python

We’ll use Scikit-Learn version of the Logistic Regression, for binary classification purposes. We’ll be using the Breast Cancer database.

# Imports
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

data = load_breast_cancer()

We then split the data into train and test :

X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size = 0.25)

By default, L2-Regularization is implemented. Using L1-Regularization, we achieve :

 lr = LogisticRegression(penalty='l1')
 lr.fit(X_train, y_train)
 
 y_pred = lr.predict(X_test)
 print(accuracy_score(y_pred, y_test))
 print(f1_score(y_pred, y_test))

 0.958041958041958
 0.9655172413793104

If we move on to L2-Regularization :

 lr = LogisticRegression(penalty='l2', solver='lbfgs')
 lr.fit(X_train, y_train)
 
 y_pred = lr.predict(X_test)
 print(accuracy_score(y_pred, y_test))
 print(f1_score(y_pred, y_test))

 0.9440559440559441
 0.9540229885057472

We notice the importance of the choice of :

the solver
the regularization

We can now illustrate the impact of the tolerance factor C. The larger C is, the less restrictive is the regularization.

 from sklearn.model_selection import GridSearchCV
 
 parameters = {'C':[0.1, 0.5, 1, 2, 5, 10, 100]}
 lr = LogisticRegression(penalty='l2', max_iter = 5000, solver='lbfgs')
 
 clf = GridSearchCV(lr, parameters, cv=5)
 clf.fit(X_train, y_train)

We fetch the best parameters using :

 clf.best_params_

And find :

 {'C': 10}

Using this classifier, we acheive the following results :

 y_pred = clf.predict(X_test)
 print(accuracy_score(y_pred, y_test))
 print(f1_score(y_pred, y_test))

0.958041958041958
0.9655172413793104

We get the same results as with the L1-Penalty, for a rather large value of C. This illustrates well the importance of wisely choosing those parameters, since a 2% accuracy or F1-Score different on a Breast Cancer detection algorithm can make a big difference.

I hope you enjoyed this article. Don’t hesitate to comment if you have any question.

Sources :

Interpretable Machine Learning book

Binary output prediction and Logistic Regression

Maël Fabien