Large volumes of data are nearly impossible to process using **batch learning**. Batch learning is the classical process by which we train a model of a given batch of the training data. With tens of millions of data points, Random Forest algorithms, for example, become nearly impossible to compute in a reasonable time.

Therefore, 2 options exist :

- use better hardware: more powerful computers, or parallelized computation. This is however costly, and not always possible.
- build online learning algorithms

# What is Online learning?

## Concept

Online learning is the process by which we :

- make an initial guess model
- pick one observation from the training data and recalibrate the weights on each input parameter
- we make one pass over the data, sequentially, as they come in from the stream

Thanks to Online learning, we are allowed to :

- make one pass on the input data, which is time-efficient
- not store all points in the learning procedure, which is memory efficient

However, when using Online learning :

- the features we previously built might not be relevant anymore
- if the data changes over time, the model is not relevant anymore
- if the network connection is down, the model cannot produce an output anymore
- it is hard to state whether an algorithm performs well since there is no test set typically

## Illustration

Let’s create a small example for classification purposes :

- say that we have 10 voters, i.e people who are going to vote on the classification task
- we initialize the weight of each person to 1

We build a model based on those features :

- if a person is right when predicting the output, we keep its weight constant
- if a person is wrong when predicting the output, we divide its weight by a factor > 1

This way, we are making sure that the weight of the voters that tend to predict the wrong classes is small, whereas the weight of the voters that predict correctly the output remains high

## When to use Online Learning

We should use Online Learning when :

- there is too much data to fit in memory, or the training time of a batch learning algorithm takes too long for our application
- the application is itself inherently online

# Mathematical motivation

Suppose that we tell someone to randomly say 0 or 1, and try to teach an algorithm to predict what the user is about to say. Using batch learning, the accuracy would typically not exceed 50%. There is however an inherent nonrandomness in the sequences that humans can generate.

That being said, we could be able to develop an algorithm that predicts the right sequence with an accuracy slightly higher than 50%.

## Deterministic Strategy

Suppose that we try to predict the person’s random sequence with a deterministic strategy : \(A = (\hat{y_1}, ... \hat{y_n})\) where \(y = (y_1, ..., y_n) \in \{-1,1\}\) is the sequence generated by the user and \(\hat{y} = \hat{y}(y_1, ..., y_n)\) is the corresponding predicted sequence.

We make an average number of mistakes equal to :

\[\frac{1}{n} \sum_t 1{\hat{y_t} ≠ y_t}\]If we systematically predict the wrong output, the average loss will be 1.

## Randomized Strategy

Using a randomized strategy, our algorithm is denoted by : \(A=(q_1, ..., q_n)\) with \(q_t = q_t(y_1, ..., y_{t-1}) \in [-1,1]\)

The average number of mistakes we’ll make over a given sequence using this randomized algorithm is now :

\[l(A; y_1, ..., y_n) = E(\frac{1}{n} \sum_t 1{\hat{y_t} ≠ y_t})\]An algorithm guessing randomly will systematically reach an average loss of \(\frac {1}{2}\), so there is no longer a “bad sequence” that would lead to an average loss of 1.

We would be winning if you could fin an algorithm for which \(l(A; y_1, ..., y_n)\) is always smaller than 50%.

If we consider Randemacher random variables (unbiaised coin flips) \(\epsilon_1, ... ,\epsilon_n\), then any algorithm will necessarly make more mistakes on some sequences, and less mistakes on other, with an average of 50% :

\[\frac{1}{2^n} \sum_{y_1, ..., y_n} l(A; y_1, ..., y_n) = E_{\epsilon_1, ... ,\epsilon_n}l(A; \epsilon_1, ... ,\epsilon_n) = \frac{1}{2}\]## Sequence patterns

In practice, asking a human to produce random sequences brings a bias, since we might find some patterns within the generated sequences. If we care about predicting correctly some sequence, typically the ones generated by humans, while accepting that we’ll perform poorly on others, we might find an algorithm \(A\) that suits our needs.

We have some idea of the algorithms we’ll encounter in practice. But how do we build an algorithm that takes this into account?

We denote \(\phi(y_1, ..., y_n)\) as the target expected value we’d like to achieve on a sequence \(y_1, ..., y_n\). In other words, it’s the next value of the sequence. We compute the values of \(\phi\) on all such sequences.

In a sense, \(\phi\) is a function on the binary hypercube \(\{-1,1\}^n\). We define \(\phi : \{±1\}^n → R\). Under the uniform distribution on \(\{±1\}^n\), \(E_{\phi} = \frac{1}{2}\)

Lemma: If \(E_{\phi(\epsilon_1, ... ,\epsilon_n)} = \frac{1}{2}\), there is a randomized algorithm such that \(l(A; y_1, ..., y_n) = \phi(y_1, ..., y_n)\) for all sequences.

Corollary: For any \(\phi : \{±1\}^n\) and \(E_{\phi} ≥ \frac{1}{2}\), then, there exists a randomized algorithm \(A\) making, on average over its randomization, no more than \(\phi(y_1, ..., y_n)\) number of mistakes for any sequence.

If we are predicting node labels, for example, we might tilt \(\phi\) towards smooth labeling to respect the manifold property and respect an expected homogeneity among a graph structure.

## Constructing \(\phi\)

One of the first method to construct a good \(\phi\) in to combine several good predictors into a meta-predictor. After that, we may define a collection of \(\phi\)’s that predict human-generated sequences according to some finite state machines and create a meta-predictor that does as well as the best of them on the given sequence.

Performing as well as a finite collection of predictors is called **experts setting**. Assume that we have access to the predictions of each of the k expert, say \(P_1, ..., P_k\). On round \(t\), they produce a number, called **advice**, \(P_j(y_1, ..., y_{t-1}) \in \{-1, 1\}\).

For each \(j \in k\), define :

\(\phi^j(y_1, ..., y_n) = \frac{1}{n} \sum_t 1 \{ P_j(y_1, ..., y_{t-1}) ≠ y_t \}\), the performance of expert \(j\), which is known only at the end. Our aim is to do as well as the best expert, say :

\[min_{j \in k} \phi^j(y_1, ..., y_n)\]Under uniform distribution, the expectation of this function is less than \({1}{2}\). Recall the first corollary, in which we stated that in order to have a decent accuracy on human generated sequences, we need to have \(E_{\phi} ≥ \frac{1}{2}\). We will add a vanishing term to the expression defined above :

\[\phi(y_1, ..., y_n) = min_{j \in k} \phi^j(y_1, ..., y_n) + \sqrt{ \frac {c \log(k)}{n}}\]## Example

Suppose that we have information that human tend to have an imbalance between the 0’s and the 1’s in the sequences they generate. Then, there is a simple algorithm that will make at most :

\[min(\bar{p}, 1-\bar{p}) + O(1/\sqrt{n})\]The method does not need to know the proportion. If the sequence consists of 80% of 1’s, the algorithm will roughly make 20% of errors if n is large. To conclude this introduction, we can easily build a prediction method that will win over any unbalanced sequence entered by a human.

Conclusion: That’s it ! I hope this introduction to Online Learning was clear. Don’t hesitate to drop a comment if you have any question.