“An anomaly is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”, Hawkins (1980)

Anomaly detection is used in :

- network intrusions
- creit card fraud detection
- insurance
- finance
- surveillance
- rental services
- …

There are 3 types of anomaly detection :

- supervised : we have labels for both normal data and anomalies
- semi-supervised : only normal data is available, no outliers are present
- unsupervised : no labels, we suppose that anomalies are rare events

The key steps in anomaly detection are the following :

- learn a profile of a normal behavior, e.g. patterns, summary statistics…
- use that normal profile to build a decision function
- detect anomalies among new observations

# Unsupervised Anomaly Detection

In unsupervised anomaly detection, we make the assumption that anomalies are rare events. The underlying data are unlabeled (no normal/abnormal label), hence the denomination. Anomalies are events supposed to be locates in the tail of the distribution.

## Minimum Volume Set

We usually estimate the region where the data is the most concentrated as the minimal region containing of the values. This approach is called **Minimum Volume Set**. Samples that do not belong to the Minimum Volume Set are anomalies. For a small value of on the other hand, we tend to recover the modes.

We can define the MV Set as :

Where :

- is a factor close to 1 that reflects the percentage of values in the MV Set.
- classes of measurable sets
- a Lebesgue measure
- the unknown probability measure of the observations

The goal is to learn a minimum volume set for

There are theoretical guarantees that there exists a unique MV set at level .

Let’s now make the assumption that has a density that :

- is bounded
- has no plateau

## Algorithms for anomaly detection

### Plug-in technique

This technique consists in seeing MV sets as density level sets :

Using a naive approach (the 2-split trick), we can :

- compute a density estimator using parametric models or local averaging based on
- based on a second sample , compute the empirical quantile corresponding to the percentile

The output is :

### Unsupervised as binary classification

We can turn the unsupervised anomaly detection problem as a binary classification.

- Suppose that our data lie into , or that we scaled the data before. We have points.
- We then generate a new sample from the uniform distribution . We generate points.
- We assign a negative label to the generated sample.
- We assign a positive label to the true sample.

We have :

approximately.

The solution of this binary classification mimics the Bayes Classifier where we predict on the set , and otherwise.

This solution provides a MV set at level . We select manually, and we tune using a grid search.

### Histograms

In histograms, we suppose that we have a compact feature space, and that we build partitions formed of measurable subsets of same volume :

The class is the ensemble composed of unions of . We want to get the minimal number of partitions to keep in order to isolate the normal data and exclude the anomalies. More formally, we are looking for the solution of :

There is a simple and fast procedure to do this :

- For each class , we compute , i.e the proportion of data in each class
- Sort the cells by decresing order of
- Find the minimal value of k such that : . In other words, we exclude percent of the values based on this histogram.
- The output is

Where is a tolerance level. However, such technique is usually not flexible enough.

### Decision Trees

The partition should be determined based on the **data**. We adopt a top-down strategy, and start from the root node . The volume of any cell can be computed in a recursive manner.

Hyperrectangless with the label 1 are subsets of the MV set estimate. At each node , the split leading to siblings and minimizes recursively the volume of the current ddecision set G under the following constraint :

To avoid overfitting, we generally prune the tree after growing it to avoid overfitting.

### Isolation Forest

In insolation forests, we build a forest of isolation trees under the assumption that anomalies are more susceptible to isolation under random partitioning.

The main idea behind an **isolation tree** is to recursively (top-down) build a tree that :

- chooses randomly a splitting variable
- splits at value
- splits the cells if the obsevation is larger or lower than

This process stops when a depth limit is reached usually :

# Supervised Anomaly Detection

In the supervised Setup, we have pairs of data values in . If a data point has a positive label (Y=1), it is an anomaly. This is a classic binary classification task.

We also define a randking among anomalies, with the first data points the ones that have the highest probability to be anomalies. This is called *Bipartite Ranking*. It is a different approach from classification, since classification remains a local task, whereas ranking is global.

We rank and score a set of instances through a scoring function in which large number instances with label 1 appear on top of the list with high probabilities.

How do we measure the quality of a ranking?

A natural estimate of the ranking performance :

is the U-Statistic :

This measure is also called the **rate of concording pairs**, or **Kendall’s association coefficient**. One of the issues is that the computation of the gradients typically require to average over pairs.

The criterion we want to maximize is :

The empirical counterpart of is really prohibitive to compute, and the maximization is even unfeasible. To overcome such issues, generalized U-Statistics using Kernels or incomplete U-Statistics using only a sample of terms can be used.

Like it? Buy me a coffee

## Leave a comment