Lecture 8

2024-04-30

Seed used in these slides

set.seed(1024)

Libraries used in these slides

library(fpc)
library(dplyr)
library(ggplot2)
library(DMwR2)

Anomaly Detection

Has clear ties with clustering
- Clustering: find and group similar items
- Anomaly Detection: find items which do not belong to any groups
Types of outliers
- Point outliers: a point out of the normal
- Contextual outliers: a point out of the specific context
  - It is normal to have a heart rate of 80bpm
  - …unless you are dead.
- Collective outliers: multiple points where only a few is ok
  - Multiple failed login attempts

Univariate Outlier Detection

the boxplot rule

\([Q_1-1.5\times IQR, Q_3+1.5\times IQR]\)

Grubb’s test

\(\displaystyle z=\frac{|x-\bar x|}{s_x}\)

\(\tau = t^2_{\alpha/(2N),N-2}\)

\(\displaystyle z\geq \frac{N-1}{\sqrt N} \sqrt{\frac {\tau} {N-2+\tau}}\)

case is an outlier if this inequality holds.

implemented in package outliers as grubbs.test()

Univariate Outlier Detection

For categorical variables there is no simple formula
We need expert knowledge to compare the distribution of values
- Then, we can label anomalies

Multi-Variate Outlier Detection

Types of detection
- Supervised
- Unsupervised
- Semi-supervised

Multi-Variate Outlier Detection

Unsupervised
- DBSCAN (we had covered last week)

dbscan.outliers <- function(data, ...) {
  require(fpc, quietly=TRUE)
  cl <- dbscan(data, ...)
  posOuts <- which(cl$cluster == 0)
  list(positions = posOuts,
       outliers = data[posOuts,],
       dbscanResults = cl)
  }

Unsupervised

house.data

load("house.data")   # loads houseData from file
names(houseData)

## [1] "MustakilMi"   "OrijinalAlan" "BanyoSayisi"  "OdaSayisi"    "SalonSayisi" 
## [6] "ToplamKat"    "GercekYas"    "FiyatTL"

outs <- dbscan.outliers(houseData, 
                        eps = 3, 
                        scale=TRUE)
outs$positions

## [1]  24  65 100 174 190 271

houseData$outlier = 0
houseData$outlier[outs$positions] = 1

ggplot(houseData, aes(
  x = GercekYas, 
  y = FiyatTL, 
  color = as.factor(outlier))) + 
  geom_point() + 
  theme(legend.position="bottom")

Unsupervised

Another method is \(OR_h\) by Torgo, 2007.
- Uses the merge process of agglomerative hierarchical clustering technique

houseData$outlier = NULL
outs <- outliers.ranking(scale(houseData))
outs$rank.outliers[1:10]

##  [1]   2  46  56 133 180 198 241 251  45 204

houseData$outlier <- 0
houseData$outlier[
  outs$rank.outliers[1:10]] <- 1

ggplot(houseData, aes(
  x = GercekYas, 
  y = FiyatTL, 
  color = as.factor(outlier))) + 
  geom_point() + 
  theme(legend.position="bottom")

Unsupervised

Another method is LOF by Breunig et al., 2000
It is implemented as lofactor in the book package

houseData$outlier = NULL
out.scores <- lofactor(scale(houseData), 15)
top_outliers <- order(out.scores, decreasing = T)[1:10]
top_outliers

##  [1] 243  24 100 132 266 174 190  65 248  57

houseData$outlier <- 0
houseData$outlier[top_outliers] <- 1

ggplot(houseData, aes(
  x = GercekYas, 
  y = FiyatTL, 
  color = as.factor(outlier))) + 
  geom_point() + 
  theme(legend.position="bottom")

Supervised

Training data with manually labeled outliers is required
Train a classification model with outliers being the target variable
Use the model for detecting outliers in new training data

Major problem : Imbalance!

Outliers are outliers, so they will be out numbered
This imbalance creates problems for learning algorithms
- If outliers are 2% in the set, labeling everything as normal has an accuracy of 98% !
- Models usually ignore outliers: they are designed to detect regularities, not irregularities

Supervised

To fix imbalance
- over sample outliers
- under sample regulars
- if supported by the ML method, use biased cost matrices

Predictive Analysis

Using the data at hand, build a model which can be used to predict the value of a response variable based on the values of input variables.

Almost all ML models are basically curve fitting algorithms
If you fit a curve to the existing data points, you can use this curve to compute unknown/unobserved points

Predictive Analysis

Mainly two types:

Classification: nominal target variable
Regression: numeric target variable

Ordinals may go into one of these categories.

Predictive Analysis

Mostly, predictive analysis is curve fitting.

\(f(X_1, X_2, ..., X_k) \rightarrow Y\)

Overall approach:

First assume the shape of \(f\) (the type of model)
- linear, logical, probabilistic, complex, ensemble
Based on the data, optimize \(f\)
Evaluate results

Predictive Analysis

Why choose one model over another?

Understandability / Readability
Speed / Complexity
Accuracy / Success of prediction

Evaluation Metrics

Classification

Confusion matrix
- A matrix displaying frequencies of observations for an interaction of predictions and ground truth
- The predictions are the columns and the actual values are the rows

	\(c_1\)	\(c_2\)	\(c_3\)
\(c_1\)	a	b	c
\(c_2\)	d	e	f
\(c_3\)	g	h	i

a: the actual value is \(c_1\) and the prediction is \(c_1\)
b: the actual value is \(c_1\) but the prediction is \(c_2\)
d: the actual value is \(c_2\) but the prediction is \(c_1\)

Classification

Error rate (aka. the 0/1 loss)

\[L_{0/1} = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}}{I(\hat h(x_i) \neq y_i)}\]

where,

\(N_{test}\) is the number of test cases.
\(I(x)\) is an indicator function:
- x is false \(\rightarrow I(x) = 0\)
- x is true \(\rightarrow I(x) = 1\)
\(\hat h(x_i)\) is the prediction for \(x_i\)
\(y_i\) is the actual target value for observation i

Classification

Accuracy

\[Acc = 1-L_{0/1}\]

	\(c_1\)	\(c_2\)	\(c_3\)
\(c_1\)	a	b	c
\(c_2\)	d	e	f
\(c_3\)	g	h	i

\(\displaystyle Acc = \frac{a+e+i}{N_{test}}\)

Classification

Cost/benefit matrix

	\(c_1\)	\(c_2\)	\(c_3\)
\(c_1\)	\(B_{1,1}\)	\(C_{1,2}\)	\(C_{1,3}\)
\(c_2\)	\(C_{2,1}\)	\(B_{2,2}\)	\(C_{2,3}\)
\(c_3\)	\(C_{3,1}\)	\(C_{3,2}\)	\(B_{3,3}\)

Provides flexible cost and benefit values for each type of prediction
- Especially useful in imbalanced datasets
- Also, fraud detection, outlier detection, etc.

Utility

Utility is computed as \[U = \sum^{n_c}_{i=1}{\sum^{n_c}_{k=1}{CM_{i,k}\times CB_{i,k}}}\]
CM: Confusion matrix
CB: Cost/benefit matrix

Classification

Standard CB matrix:

	outlier	normal
outlier	1	0
normal	0	1

An example CB matrix for outlier detection:

	outlier	normal
outlier	5	-5
normal	-1	0.1

Consider 98% regular, 2% outlier
- If we mark everything as normal
  - standard utility : 98
  - modified utility : -10 + 9.8 = -0.2
You can normalize by maximum utility possible
- standard utility : 98 / 100 = 0.98
- modified utility : -0.2 / 19.8 = -0.0101

Classification

When you have a binary classification

	T	F
T	TP	FN
F	FP	TN

Precision: rate of correctly identified trues to all predicted as true.

Prec = \(\frac{TP}{TP+FP}\)

Recall: rate of correctly identified trues to all actual trues.

Rec = \(\frac{TP}{TP+FN}\)

Classification

You can aggregate precision and recall into one metric, the F-measure:

\(F_\beta = \frac{(\beta^2+1)\times Prec \times Rec}{\beta^2\times Prec + Rec}\)

Regression

For numeric target variables, one frequently used metric is the mean squared error:

\(\displaystyle MSE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}\)

Or for the sake of unit compliance, use root mean squared error:

\(\displaystyle RMSE = \sqrt{MSE}\)

Or, alternatively use mean absolute error:

\(\displaystyle MAE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}\)

Regression

You can use a baseline method to produce relative error metrics.
A baseline method is something naive, such as the mean \(y\) value
Normalized mean squared error:

\(\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}{\sum^{N_{test}}_{i=1}{(\bar y_i - y_i)^2}}\)

We expect NMSE to be close to 0. A value of 1 means a performance as bad as the baseline.
Also, Normalized mean absolute error

\(\displaystyle NMAE = \frac{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}{\sum^{N_{test}}_{i=1}{|\bar y_i - y_i|}}\)

Implementations

There are many implementations of these metrics
- function mmetric in package rminer (Cortez, 2015)
- functions classificationMetrics and regressionMetrics in package performanceEstimation (Torgo, 2014a)
- function performance in package ROCR (Sing et al., 2009)
- function performance in package mlr (Bischl et al., 2016)
And, you can always compute them on the fly.

In-class At-home Activity

Load house.data into R
Apply clustering to the data
- How many clusters seems to be the optimal?
Apply anomaly detection to the data
- Do you catch a few or many anomalies?