27/04/2020

## Seed used in these slides

set.seed(1024)

## Libraries used in these slides

library(fpc)
library(dplyr)
library(ggplot2)
library(DMwR2)

## Anomaly Detection

• Has clear ties with clustering
• Clustering: find and group similar items
• Anomaly Detection: find items which do not belong to any groups
• Types of outliers
• Point outliers: a point out of the normal
• Contextual outliers: a point out of the specific context
• It is normal to have a heart rate of 80bpm
• Collective outliers: multiple points where only a few is ok

## Univariate Outlier Detection

• the boxplot rule

$$[Q_1-1.5\times IQR, Q_3+1.5\times IQR]$$

• Grubb’s test

$$\displaystyle z=\frac{|x-\bar x|}{s_x}$$

$$\tau = t^2_{\alpha/(2N),N-2}$$

$$\displaystyle z\geq \frac{N-1}{\sqrt N} \sqrt{\frac {\tau} {N-2+\tau}}$$

case is an outlier if this inequality holds.

• implemented in package outliers as grubbs.test()

## Univariate Outlier Detection

• For categorical variables there is no simple formula
• We need expert knowledge to compare the distribution of values
• Then, we can label anomalies

## Multi-Variate Outlier Detection

• Types of detection
• Supervised
• Unsupervised
• Semi-supervised

## Multi-Variate Outlier Detection

• Unsupervised
• DBSCAN (we had covered last week)
dbscan.outliers <- function(data, ...) {
require(fpc, quietly=TRUE)
cl <- dbscan(data, ...)
posOuts <- which(cl$cluster == 0) list(positions = posOuts, outliers = data[posOuts,], dbscanResults = cl) } ## Unsupervised load("house.data") # loads houseData from file names(houseData) ## [1] "MustakilMi" "OrijinalAlan" "BanyoSayisi" "OdaSayisi" "SalonSayisi" ## [6] "ToplamKat" "GercekYas" "FiyatTL" outs <- dbscan.outliers(houseData, eps = 3, scale=TRUE) outs$positions
## [1]  24  65 100 174 190 271
houseData$outlier = 0 houseData$outlier[outs$positions] = 1 ggplot(houseData, aes( x = GercekYas, y = FiyatTL, color = as.factor(outlier))) + geom_point() + theme(legend.position="bottom") ## Unsupervised • Another method is $$OR_h$$ by Torgo, 2007. • Uses the merge process of agglomerative hierarchical clustering technique houseData$outlier = NULL
outs <- outliers.ranking(scale(houseData))
outs$rank.outliers[1:10] ## [1] 2 46 56 133 180 198 241 251 45 204 houseData$outlier <- 0
houseData$outlier[ outs$rank.outliers[1:10]] <- 1
ggplot(houseData, aes(
x = GercekYas,
y = FiyatTL,
color = as.factor(outlier))) +
geom_point() +
theme(legend.position="bottom")

## Unsupervised

• Another method is LOF by Breunig et al., 2000
• It is implemented as lofactor in the book package
houseData$outlier = NULL out.scores <- lofactor(scale(houseData), 15) top_outliers <- order(out.scores, decreasing = T)[1:10] top_outliers ## [1] 243 24 100 132 266 174 190 65 248 57 houseData$outlier <- 0
houseData\$outlier[top_outliers] <- 1
ggplot(houseData, aes(
x = GercekYas,
y = FiyatTL,
color = as.factor(outlier))) +
geom_point() +
theme(legend.position="bottom")

## Supervised

• Training data with manually labeled outliers is required
• Train a classification model with outliers being the target variable
• Use the model for detecting outliers in new training data

Major problem : Imbalance!

• Outliers are outliers, so they will be out numbered
• This imbalance creates problems for learning algorithms
• If outliers are 2% in the set, labeling everything as normal has an accuracy of 98% !
• Models usually ignore outliers: they are designed to detect regularities, not irregularities

## Supervised

• To fix imbalance
• over sample outliers
• under sample regulars
• if supported by the ML method, use biased cost matrices

## Predictive Analysis

Using the data at hand, build a model which can be used to predict the value of a response variable based on the values of input variables.

• Almost all ML models are basically curve fitting algorithms
• If you fit a curve to the existing data points, you can use this curve to compute unknown/unobserved points

## Predictive Analysis

Mainly two types:

• Classification: nominal target variable
• Regression: numeric target variable

Ordinals may go into one of these categories.

## Predictive Analysis

Mostly, predictive analysis is curve fitting.

$$f(X_1, X_2, ..., X_k) \rightarrow Y$$

Overall approach:

1. First assume the shape of $$f$$ (the type of model)
• linear, logical, probabilistic, complex, ensemble
2. Based on the data, optimize $$f$$
3. Evaluate results

## Predictive Analysis

Why choose one model over another?

• Speed / Complexity
• Accuracy / Success of prediction

## Classification

• Confusion matrix
• A matrix displaying frequencies of observations for an interaction of predictions and ground truth
• The predictions are the columns and the actual values are the rows
$$c_1$$ $$c_2$$ $$c_3$$
$$c_1$$ a b c
$$c_2$$ d e f
$$c_3$$ g h i
• a: the actual value is $$c_1$$ and the prediction is $$c_1$$
• b: the actual value is $$c_1$$ but the prediction is $$c_2$$
• d: the actual value is $$c_2$$ but the prediction is $$c_1$$

## Classification

• Error rate (aka. the 0/1 loss)

$L_{0/1} = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}}{I(\hat h(x_i) \neq y_i)}$

where,

• $$N_{test}$$ is the number of test cases.
• $$I(x)$$ is an indicator function:
• x is false $$\rightarrow I(x) = 0$$
• x is true $$\rightarrow I(x) = 1$$
• $$\hat h(x_i)$$ is the prediction for $$x_i$$
• $$y_i$$ is the actual target value for observation i

## Classification

• Accuracy

$Acc = 1-L_{0/1}$

$$c_1$$ $$c_2$$ $$c_3$$
$$c_1$$ a b c
$$c_2$$ d e f
$$c_3$$ g h i

$$\displaystyle Acc = \frac{a+e+i}{N_{test}}$$

## Classification

• Cost/benefit matrix
$$c_1$$ $$c_2$$ $$c_3$$
$$c_1$$ $$B_{1,1}$$ $$C_{1,2}$$ $$C_{1,3}$$
$$c_2$$ $$C_{2,1}$$ $$B_{2,2}$$ $$C_{2,3}$$
$$c_3$$ $$C_{3,1}$$ $$C_{3,2}$$ $$B_{3,3}$$
• Provides flexible cost and benefit values for each type of prediction
• Especially useful in imbalanced datasets
• Also, fraud detection, outlier detection, etc.

## Utility

• Utility is computed as $U = \sum^{n_c}_{i=1}{\sum^{n_c}_{k=1}{CM_{i,k}\times CB_{i,k}}}$

• CM: Confusion matrix

• CB: Cost/benefit matrix

## Classification

Standard CB matrix:

outlier normal
outlier 1 0
normal 0 1

An example CB matrix for outlier detection:

outlier normal
outlier 5 -5
normal -1 0.1
• Consider 98% regular, 2% outlier
• If we mark everything as normal
• standard utility : 98
• modified utility : -10 + 9.8 = -0.2
• You can normalize by maximum utility possible
• standard utility : 98 / 100 = 0.98
• modified utility : -0.2 / 19.8 = -0.0101

## Classification

• When you have a binary classification
T F
T TP FN
F FP TN
• Precision: rate of correctly identified trues to all predicted as true.

Prec = $$\frac{TP}{TP+FP}$$

• Recall: rate of correctly identified trues to all actual trues.

Rec = $$\frac{TP}{TP+FN}$$

## Classification

• You can aggregate precision and recall into one metric, the F-measure:

$$F_\beta = \frac{(\beta^2+1)\times Prec \times Rec}{\beta^2\times Prec + Rec}$$

## Regression

• For numeric target variables, one frequently used metric is the mean squared error:

$$\displaystyle MSE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}$$

• Or for the sake of unit compliance, use root mean squared error:

$$\displaystyle RMSE = \sqrt{MSE}$$

• Or, alternatively use mean absolute error:

$$\displaystyle MAE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}$$

## Regression

• You can use a baseline method to produce relative error metrics.
• A baseline method is something naive, such as the mean $$y$$ value
• Normalized mean squared error:

$$\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}{\sum^{N_{test}}_{i=1}{(\bar y_i - y_i)^2}}$$

• We expect NMSE to be close to 0. A value of 1 means a performance as bad as the baseline.
• Also, Normalized mean absolute error

$$\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}{\sum^{N_{test}}_{i=1}{|\bar y_i - y_i|}}$$

## Implementations

• There are many implementations of these metrics
• function mmetric in package rminer (Cortez, 2015)
• functions classificationMetrics and regressionMetrics in package performanceEstimation (Torgo, 2014a)
• function performance in package ROCR (Sing et al., 2009)
• function performance in package mlr (Bischl et al., 2016)
• And, you can always compute them on the fly.