So far, with the linear model, we have seen how to predict continuous variables. What happens when you want to classify with a linear model?

Linear Probability Model

Suppose that our aim is to do binary classification : . Let’s consider the model :

Where . How can we perform binary classification with this model? Let’s start with a dataset in which you have binary observations and you decide to fit a linear regression on top of it.

image

Statistically speaking, the model above is incorrect :

  • we would need to define a threshold under which we classify as 0, and above which we classify 1
  • what if the values are greater than 1? Or smaller than 0?

Linear probability model has however one main advantage: the coefficients remain easily interpretable!

In other words, the impact of a coefficient can be measured as a contribution percentage to the final classification. Overall, this model needs to be adjusted/transformed to throw the predicted between values between 0 and 1. This is the main idea of logistic regression!

Logistic Regression

We are now interested in . As you might guess, the way we define will define the way we make our mapping.

  • If is linear, this is obviously the linear regression
  • If is a sigmoid : , then the model is a logistic regression
  • If is a normal transformation , then the model is a probit regression

In this article, we’ll focus on logistic regression.

Sigmoid and Logit transformations

The sigmoid transformation is used to map values between 0 and 1 :

To understand precisely what this does, let’s implement it in Python :

import numpy as np
import matplotlib.pyplot as plt
import math

def sigmoid(x):
    a = []
    for item in x:
        a.append(1/(1+math.exp(-item)))
    return a

Then, plot if for a range of values of :

x = np.arange(-3., 3., 0.2)
sig = sigmoid(x)

plt.figure(figsize=(12,8))
plt.plot(x,sig, label='sigmoid')
plt.plot(x,x, label='input')
plt.title("Sigmoid Function")
plt.legend()
plt.show()

image

The inverse transform is called the logit transform. It takes values that are in the range 0 to 1 and maps them to a linear form.

def logit(x):
    a = []
    for item in x:
        a.append(math.log(item/(1-item)))
    return a
plt.figure(figsize=(12,8))
plt.plot(x,sig, label='sigmoid')
plt.plot(x,logit(sig), label='logit tranform')
plt.title("Sigmoid - Logit Function")
plt.legend()
plt.show()

image

The logistic regression model

Partial effect

In the logistic regression model :

How can we interpret the partial effect of on for example ? Well, the weights in the logistic regression cannot be interpreted as for linear regression. We need to use the logit transform :

We define the this ratio as the “odds”. Therefore, to estimate the impact of increasing by 1 unit, we can compute it this way :

A change in by one unit increases the log odds ratio by the value of the corresponding weight.

Test Hypothesis

To test for a single coefficient, we apply, as previously, a Student test :

For multiple hypotheses, we choose the Likelihood Ratio tests. The coefficients are now normally distributed, so the sum of several coefficients follows a (Chi-Squared) distribution.

The Likelihood ratio test is implemented in most stats packages in Python, R, and Matlab, and is defined by :

We reject the null hypothesis if .

Important parameters

In the Logistic Regression, the single most important parameter is the regularization factor. It is essential to choose properly the type of regularization to apply (usually by Cross-Validation).

Implementation in Python

We’ll use Scikit-Learn version of the Logistic Regression, for binary classification purposes. We’ll be using the Breast Cancer database.

# Imports
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

data = load_breast_cancer()

We then split the data into train and test :

X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size = 0.25)

By default, L2-Regularization is implemented. Using L1-Regularization, we achieve :

 lr = LogisticRegression(penalty='l1')
 lr.fit(X_train, y_train)
 
 y_pred = lr.predict(X_test)
 print(accuracy_score(y_pred, y_test))
 print(f1_score(y_pred, y_test))
 0.958041958041958
 0.9655172413793104

If we move on to L2-Regularization :

 lr = LogisticRegression(penalty='l2', solver='lbfgs')
 lr.fit(X_train, y_train)
 
 y_pred = lr.predict(X_test)
 print(accuracy_score(y_pred, y_test))
 print(f1_score(y_pred, y_test))
 0.9440559440559441
 0.9540229885057472

We notice the importance of the choice of :

  • the solver
  • the regularization

We can now illustrate the impact of the tolerance factor C. The larger C is, the less restrictive is the regularization.

 from sklearn.model_selection import GridSearchCV
 
 parameters = {'C':[0.1, 0.5, 1, 2, 5, 10, 100]}
 lr = LogisticRegression(penalty='l2', max_iter = 5000, solver='lbfgs')
 
 clf = GridSearchCV(lr, parameters, cv=5)
 clf.fit(X_train, y_train)

We fetch the best parameters using :

 clf.best_params_

And find :

 {'C': 10}

Using this classifier, we acheive the following results :

 y_pred = clf.predict(X_test)
 print(accuracy_score(y_pred, y_test))
 print(f1_score(y_pred, y_test))
0.958041958041958
0.9655172413793104

We get the same results as with the L1-Penalty, for a rather large value of C. This illustrates well the importance of wisely choosing those parameters, since a 2% accuracy or F1-Score different on a Breast Cancer detection algorithm can make a big difference.

I hope you enjoyed this article. Don’t hesitate to comment if you have any question.

Sources :


Like it? Buy me a coffeeLike it? Buy me a coffee

Leave a comment