“An anomaly is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”, Hawkins (1980)

Anomaly detection is used in :

- network intrusions
- creit card fraud detection
- insurance
- finance
- surveillance
- rental services
- …

There are 3 types of anomaly detection :

- supervised : we have labels for both normal data and anomalies
- semi-supervised : only normal data is available, no outliers are present
- unsupervised : no labels, we suppose that anomalies are rare events

The key steps in anomaly detection are the following :

- learn a profile of a normal behavior, e.g. patterns, summary statistics…
- use that normal profile to build a decision function
- detect anomalies among new observations

# Unsupervised Anomaly Detection

In unsupervised anomaly detection, we make the assumption that anomalies are rare events. The underlying data are unlabeled (no normal/abnormal label), hence the denomination. Anomalies are events supposed to be locates in the tail of the distribution.

## Minimum Volume Set

We usually estimate the region where the data is the most concentrated as the minimal region containing \(\alpha %\) of the values. This approach is called **Minimum Volume Set**. Samples that do not belong to the Minimum Volume Set are anomalies. For a small value of \(\alpha\) on the other hand, we tend to recover the modes.

We can define the MV Set as :

\[Q( \alpha ) = argmin_{c \in C} \{ \lambda(C), P(X \in C) ≥ \alpha \}\]Where :

- \(\alpha\) is a factor close to 1 that reflects the percentage of values in the MV Set.
- \(C\) classes of measurable sets
- \(\lambda\) a Lebesgue measure
- \(\mu(dx)\) the unknown probability measure of the observations

The goal is to learn a minimum volume set \(Q(\alpha)\) for \(X_1, \cdots, X_n\)

There are theoretical guarantees that there exists a unique MV set at level \(\alpha\).

Let’s now make the assumption that \(\mu\) has a density \(h(X)\) that :

- is bounded
- has no plateau

## Algorithms for anomaly detection

### Plug-in technique

This technique consists in seeing MV sets as density level sets :

\[G_{\alpha}* = \{ x \in R^d : h(x) ≥ t_{\alpha} \}\]Using a naive approach (the 2-split trick), we can :

- compute a density estimator using parametric models or local averaging \(\hat{h}(x)\) based on \(X_1, \cdots, X_n\)
- based on a second sample \(X_1*, \cdots, X_n*\), compute the empirical quantile corresponding to the \(\alpha\) percentile

The output is :

\[\hat{G_{\alpha}} * = \{ x: \hat{h_n(x)} ≥ \hat{h_n} ( X_{n \alpha}' ) \}\]### Unsupervised as binary classification

We can turn the unsupervised anomaly detection problem as a binary classification.

- Suppose that our data lie into \([0,1]^d\), or that we scaled the data before. We have \(n\) points.
- We then generate a new sample from the uniform distribution \(U([0,1]^d)\). We generate \(m\) points.
- We assign a negative label to the generated sample.
- We assign a positive label to the true sample.

We have :

\(\frac{n}{n+m} = p\) approximately.

The solution of this binary classification mimics the Bayes Classifier where we predict \(1\) on the set \(\{ x \in [0,1]^d : h(x) ≥ \frac{1-p}{p} \}\), and \(-1\) otherwise.

This solution provides a MV set at level \(\alpha = P(h(X) ≥ \frac{1}{p-1})\). We select \(\alpha\) manually, and we tune \(p\) using a grid search.

### Histograms

In histograms, we suppose that we have a compact feature space, and that we build partitions \(C_1, \cdots, C_K\) formed of measurable subsets of same volume :

\[\lambda(C_1) = \cdots = \lambda(C_K)\]The class \(G_P\) is the ensemble composed of unions of \(C_K\). We want to get the minimal number of partitions to keep in order to isolate the normal data and exclude the anomalies. More formally, we are looking for the solution of :

\[\min_{G \in P : \hat{\mu_n}(G) ≥ \alpha - \phi} \lambda(G)\]There is a simple and fast procedure to do this :

- For each class \(k = 1 \cdots K\), we compute \(\hat{\mu_n}(C_k)\), i.e the proportion of data in each class
- Sort the cells by decresing order of \(\hat{\mu_n}(C_k)\)
- Find the minimal value of k such that : \(\hat{k} = argmin \{ k : \sum_i \hat{\mu_n}(C_i) ≥ \alpha - \phi \}\). In other words, we exclude \(\alpha\) percent of the values based on this histogram.
- The output is \(\hat{G_{\alpha}}* = U_i^{\hat{k}} C_{(i)}\)

Where \(\phi\) is a tolerance level. However, such technique is usually not flexible enough.

### Decision Trees

The partition should be determined based on the **data**. We adopt a top-down strategy, and start from the root node \(C_{0,0}\). The volume of any cell \(C_{j,k}\) can be computed in a recursive manner.

Hyperrectangless with the label 1 are subsets of the MV set estimate. At each node \(C_{j,k}\), the split leading to siblings \(C_{j+1,2k}\) and \(C_{j+1,2k+1}\) minimizes recursively the volume of the current ddecision set G under the following constraint :

\[\hat{\mu_n}(G) ≥ \alpha(-\phi)\]To avoid overfitting, we generally prune the tree after growing it to avoid overfitting.

### Isolation Forest

In insolation forests, we build a forest of isolation trees under the assumption that anomalies are more susceptible to isolation under random partitioning.

The main idea behind an **isolation tree** is to recursively (top-down) build a tree that :

- chooses randomly a splitting variable
- splits at value \(t\)
- splits the cells if the obsevation is larger or lower than \(t\)

This process stops when a depth limit is reached usually :

# Supervised Anomaly Detection

In the supervised Setup, we have pairs of data \((X,Y)\) values in \(R^d \times \{-1,+1\}\). If a data point has a positive label (Y=1), it is an anomaly. This is a classic binary classification task.

We also define a randking among anomalies, with the first data points the ones that have the highest probability to be anomalies. This is called *Bipartite Ranking*. It is a different approach from classification, since classification remains a local task, whereas ranking is global.

We rank and score a set of instances through a scoring function \(s(X)\) in which large number instances with label 1 appear on top of the list with high probabilities.

How do we measure the quality of a ranking?

A natural estimate of the ranking performance :

\[U(s) = P \{ (s(X) -s(X'))(Y-Y')>0\}\]is the U-Statistic :

\[\hat{U}_n(s) = \frac{2}{n(n-1)} \sum_{1≤i<j≤n}1 \{ (s(X_i) - s(X_j))(Y_i - Y_j) > 0 \}\]This measure is also called the **rate of concording pairs**, or **Kendall’s association coefficient**. One of the issues is that the computation of the gradients typically require to average over \(O(n^2)\) pairs.

The criterion we want to maximize is :

\[L(s) = P \{s(X^{(1)}) < \cdots < s(X^{(k)}) \mid Y^{(1)} = 1, \cdots, Y^{(k)} = K \}\]The empirical counterpart of \(L(s)\) is really prohibitive to compute, and the maximization is even unfeasible. To overcome such issues, generalized U-Statistics using Kernels or incomplete U-Statistics using only a sample of terms can be used.