“An anomaly is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”, Hawkins (1980)

Anomaly detection is used in :

• network intrusions
• creit card fraud detection
• insurance
• finance
• surveillance
• rental services

There are 3 types of anomaly detection :

• supervised : we have labels for both normal data and anomalies
• semi-supervised : only normal data is available, no outliers are present
• unsupervised : no labels, we suppose that anomalies are rare events

The key steps in anomaly detection are the following :

• learn a profile of a normal behavior, e.g. patterns, summary statistics…
• use that normal profile to build a decision function
• detect anomalies among new observations

# Unsupervised Anomaly Detection

In unsupervised anomaly detection, we make the assumption that anomalies are rare events. The underlying data are unlabeled (no normal/abnormal label), hence the denomination. Anomalies are events supposed to be locates in the tail of the distribution.

## Minimum Volume Set

We usually estimate the region where the data is the most concentrated as the minimal region containing $\alpha %$ of the values. This approach is called Minimum Volume Set. Samples that do not belong to the Minimum Volume Set are anomalies. For a small value of $\alpha$ on the other hand, we tend to recover the modes.

We can define the MV Set as :

Where :

• $\alpha$ is a factor close to 1 that reflects the percentage of values in the MV Set.
• $C$ classes of measurable sets
• $\lambda$ a Lebesgue measure
• $\mu(dx)$ the unknown probability measure of the observations

The goal is to learn a minimum volume set $Q(\alpha)$ for $X_1, \cdots, X_n$

There are theoretical guarantees that there exists a unique MV set at level $\alpha$.

Let’s now make the assumption that $\mu$ has a density $h(X)$ that :

• is bounded
• has no plateau

## Algorithms for anomaly detection

### Plug-in technique

This technique consists in seeing MV sets as density level sets :

Using a naive approach (the 2-split trick), we can :

• compute a density estimator using parametric models or local averaging $\hat{h}(x)$ based on $X_1, \cdots, X_n$
• based on a second sample $X_1*, \cdots, X_n*$, compute the empirical quantile corresponding to the $\alpha$ percentile

The output is :

### Unsupervised as binary classification

We can turn the unsupervised anomaly detection problem as a binary classification.

• Suppose that our data lie into $[0,1]^d$, or that we scaled the data before. We have $n$ points.
• We then generate a new sample from the uniform distribution $U([0,1]^d)$. We generate $m$ points.
• We assign a negative label to the generated sample.
• We assign a positive label to the true sample.

We have :

$\frac{n}{n+m} = p$ approximately.

The solution of this binary classification mimics the Bayes Classifier where we predict $1$ on the set $\{ x \in [0,1]^d : h(x) ≥ \frac{1-p}{p} \}$, and $-1$ otherwise.

This solution provides a MV set at level $\alpha = P(h(X) ≥ \frac{1}{p-1})$. We select $\alpha$ manually, and we tune $p$ using a grid search.

### Histograms

In histograms, we suppose that we have a compact feature space, and that we build partitions $C_1, \cdots, C_K$ formed of measurable subsets of same volume :

The class $G_P$ is the ensemble composed of unions of $C_K$. We want to get the minimal number of partitions to keep in order to isolate the normal data and exclude the anomalies. More formally, we are looking for the solution of :

There is a simple and fast procedure to do this :

• For each class $k = 1 \cdots K$, we compute $\hat{\mu_n}(C_k)$, i.e the proportion of data in each class
• Sort the cells by decresing order of $\hat{\mu_n}(C_k)$
• Find the minimal value of k such that : $\hat{k} = argmin \{ k : \sum_i \hat{\mu_n}(C_i) ≥ \alpha - \phi \}$. In other words, we exclude $\alpha$ percent of the values based on this histogram.
• The output is $\hat{G_{\alpha}}* = U_i^{\hat{k}} C_{(i)}$

Where $\phi$ is a tolerance level. However, such technique is usually not flexible enough.

### Decision Trees

The partition should be determined based on the data. We adopt a top-down strategy, and start from the root node $C_{0,0}$. The volume of any cell $C_{j,k}$ can be computed in a recursive manner.

Hyperrectangless with the label 1 are subsets of the MV set estimate. At each node $C_{j,k}$, the split leading to siblings $C_{j+1,2k}$ and $C_{j+1,2k+1}$ minimizes recursively the volume of the current ddecision set G under the following constraint :

To avoid overfitting, we generally prune the tree after growing it to avoid overfitting.

### Isolation Forest

In insolation forests, we build a forest of isolation trees under the assumption that anomalies are more susceptible to isolation under random partitioning.

The main idea behind an isolation tree is to recursively (top-down) build a tree that :

• chooses randomly a splitting variable
• splits at value $t$
• splits the cells if the obsevation is larger or lower than $t$

This process stops when a depth limit is reached usually :

# Supervised Anomaly Detection

In the supervised Setup, we have pairs of data $(X,Y)$ values in $R^d \times \{-1,+1\}$. If a data point has a positive label (Y=1), it is an anomaly. This is a classic binary classification task.

We also define a randking among anomalies, with the first data points the ones that have the highest probability to be anomalies. This is called Bipartite Ranking. It is a different approach from classification, since classification remains a local task, whereas ranking is global.

We rank and score a set of instances through a scoring function $s(X)$ in which large number instances with label 1 appear on top of the list with high probabilities.

How do we measure the quality of a ranking?

A natural estimate of the ranking performance :

is the U-Statistic :

This measure is also called the rate of concording pairs, or Kendall’s association coefficient. One of the issues is that the computation of the gradients typically require to average over $O(n^2)$ pairs.

The criterion we want to maximize is :

The empirical counterpart of $L(s)$ is really prohibitive to compute, and the maximization is even unfeasible. To overcome such issues, generalized U-Statistics using Kernels or incomplete U-Statistics using only a sample of terms can be used.

Categories:

Updated: