set.seed(1024)

27/04/2020

set.seed(1024)

library(fpc) library(dplyr) library(ggplot2) library(DMwR2)

- Has clear ties with clustering
- Clustering: find and group similar items
- Anomaly Detection: find items which do not belong to any groups

- Types of outliers
- Point outliers: a point out of the normal
- Contextual outliers: a point out of the specific context
- It is normal to have a heart rate of 80bpm
- …unless you are dead.

- Collective outliers: multiple points where only a few is ok
- Multiple failed login attempts

- the boxplot rule

\([Q_1-1.5\times IQR, Q_3+1.5\times IQR]\)

- Grubb’s test

\(\displaystyle z=\frac{|x-\bar x|}{s_x}\)

\(\tau = t^2_{\alpha/(2N),N-2}\)

\(\displaystyle z\geq \frac{N-1}{\sqrt N} \sqrt{\frac {\tau} {N-2+\tau}}\)

case is an outlier if this inequality holds.

- implemented in package
`outliers`

as`grubbs.test()`

- For categorical variables there is no simple formula
- We need expert knowledge to compare the distribution of values
- Then, we can label anomalies

- Types of detection
- Supervised
- Unsupervised
- Semi-supervised

- Unsupervised
- DBSCAN (we had covered last week)

dbscan.outliers <- function(data, ...) { require(fpc, quietly=TRUE) cl <- dbscan(data, ...) posOuts <- which(cl$cluster == 0) list(positions = posOuts, outliers = data[posOuts,], dbscanResults = cl) }

load("house.data") # loads houseData from file names(houseData)

## [1] "MustakilMi" "OrijinalAlan" "BanyoSayisi" "OdaSayisi" "SalonSayisi" ## [6] "ToplamKat" "GercekYas" "FiyatTL"

outs <- dbscan.outliers(houseData, eps = 3, scale=TRUE) outs$positions

## [1] 24 65 100 174 190 271

houseData$outlier = 0 houseData$outlier[outs$positions] = 1

ggplot(houseData, aes( x = GercekYas, y = FiyatTL, color = as.factor(outlier))) + geom_point() + theme(legend.position="bottom")

- Another method is \(OR_h\) by Torgo, 2007.
- Uses the merge process of agglomerative hierarchical clustering technique

houseData$outlier = NULL outs <- outliers.ranking(scale(houseData)) outs$rank.outliers[1:10]

## [1] 2 46 56 133 180 198 241 251 45 204

houseData$outlier <- 0 houseData$outlier[ outs$rank.outliers[1:10]] <- 1

- Another method is LOF by Breunig et al., 2000
- It is implemented as
`lofactor`

in the book package

houseData$outlier = NULL out.scores <- lofactor(scale(houseData), 15) top_outliers <- order(out.scores, decreasing = T)[1:10] top_outliers

## [1] 243 24 100 132 266 174 190 65 248 57

houseData$outlier <- 0 houseData$outlier[top_outliers] <- 1

- Training data with manually labeled outliers is required
- Train a classification model with outliers being the target variable
- Use the model for detecting outliers in
*new*training data

Major problem : **Imbalance**!

- Outliers are
**outliers**, so they will be**out numbered** - This imbalance creates problems for learning algorithms
- If outliers are 2% in the set, labeling everything as normal has an accuracy of 98% !
- Models usually ignore outliers: they are designed to detect regularities, not irregularities

- To fix imbalance
- over sample outliers
- under sample regulars
- if supported by the ML method, use biased cost matrices

Using the data at hand, build a model which can be used to predict the value of a response variable based on the values of input variables.

- Almost all ML models are basically curve fitting algorithms
- If you fit a curve to the existing data points, you can use this curve to compute unknown/unobserved points

Mainly two types:

- Classification: nominal target variable
- Regression: numeric target variable

Ordinals may go into one of these categories.

Mostly, predictive analysis is **curve fitting**.

\(f(X_1, X_2, ..., X_k) \rightarrow Y\)

Overall approach:

- First assume the shape of \(f\) (the type of model)
- linear, logical, probabilistic, complex, ensemble

- Based on the data, optimize \(f\)
- Evaluate results

Why choose one model over another?

**Understandability**/ Readability**Speed**/ Complexity**Accuracy**/ Success of prediction

**Confusion matrix**- A matrix displaying frequencies of observations for an interaction of predictions and
*ground truth* - The predictions are the columns and the actual values are the rows

- A matrix displaying frequencies of observations for an interaction of predictions and

\(c_1\) | \(c_2\) | \(c_3\) | |
---|---|---|---|

\(c_1\) | a | b | c |

\(c_2\) | d | e | f |

\(c_3\) | g | h | i |

- a: the actual value is \(c_1\) and the prediction is \(c_1\)
- b: the actual value is \(c_1\) but the prediction is \(c_2\)
- d: the actual value is \(c_2\) but the prediction is \(c_1\)

**Error rate**(aka. the 0/1 loss)

\[L_{0/1} = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}}{I(\hat h(x_i) \neq y_i)}\]

where,

- \(N_{test}\) is the number of test cases.
- \(I(x)\) is an indicator function:
- x is false \(\rightarrow I(x) = 0\)
- x is true \(\rightarrow I(x) = 1\)

- x is false \(\rightarrow I(x) = 0\)
- \(\hat h(x_i)\) is the prediction for \(x_i\)
- \(y_i\) is the actual target value for observation i

**Accuracy**

\[Acc = 1-L_{0/1}\]

\(c_1\) | \(c_2\) | \(c_3\) | |
---|---|---|---|

\(c_1\) | |||

\(c_2\) | |||

\(c_3\) |

\(\displaystyle Acc = \frac{a+e+i}{N_{test}}\)

**Cost/benefit matrix**

\(c_1\) | \(c_2\) | \(c_3\) | |
---|---|---|---|

\(c_1\) | \(B_{1,1}\) | \(C_{1,2}\) | \(C_{1,3}\) |

\(c_2\) | \(C_{2,1}\) | \(B_{2,2}\) | \(C_{2,3}\) |

\(c_3\) | \(C_{3,1}\) | \(C_{3,2}\) | \(B_{3,3}\) |

- Provides
**flexible cost and benefit**values for each type of prediction- Especially useful in
**imbalanced**datasets - Also,
**fraud detection**,**outlier detection**, etc.

- Especially useful in

Utility is computed as \[U = \sum^{n_c}_{i=1}{\sum^{n_c}_{k=1}{CM_{i,k}\times CB_{i,k}}}\]

CM: Confusion matrix

CB: Cost/benefit matrix

Standard CB matrix:

outlier | normal | |
---|---|---|

outlier | 1 | 0 |

normal | 0 | 1 |

An example CB matrix for outlier detection:

outlier | normal | |
---|---|---|

outlier | 5 | -5 |

normal | -1 | 0.1 |

- Consider 98% regular, 2% outlier
- If we mark everything as normal
- standard utility : 98
- modified utility : -10 + 9.8 = -0.2

- If we mark everything as normal
- You can normalize by maximum utility possible
- standard utility : 98 / 100 = 0.98
- modified utility : -0.2 / 19.8 = -0.0101

- When you have a binary classification

T | F | |
---|---|---|

T | TP | FN |

F | FP | TN |

- Precision: rate of correctly identified trues to all predicted as true.

Prec = \(\frac{TP}{TP+FP}\)

- Recall: rate of correctly identified trues to all actual trues.

Rec = \(\frac{TP}{TP+FN}\)

- You can aggregate precision and recall into one metric, the F-measure:

\(F_\beta = \frac{(\beta^2+1)\times Prec \times Rec}{\beta^2\times Prec + Rec}\)

- For numeric target variables, one frequently used metric is the
*mean squared error*:

\(\displaystyle MSE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}\)

- Or for the sake of unit compliance, use
*root mean squared error*:

\(\displaystyle RMSE = \sqrt{MSE}\)

- Or, alternatively use
*mean absolute error*:

\(\displaystyle MAE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}\)

- You can use a baseline method to produce relative error metrics.
- A baseline method is something naive, such as the mean \(y\) value
*Normalized mean squared error*:

\(\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}{\sum^{N_{test}}_{i=1}{(\bar y_i - y_i)^2}}\)

- We expect NMSE to be close to 0. A value of 1 means a performance as bad as the baseline.
- Also,
*Normalized mean absolute error*

\(\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}{\sum^{N_{test}}_{i=1}{|\bar y_i - y_i|}}\)

- There are many implementations of these metrics
- function
`mmetric`

in package`rminer`

(Cortez, 2015) - functions
`classificationMetrics`

and`regressionMetrics`

in package`performanceEstimation`

(Torgo, 2014a) - function
`performance`

in package`ROCR`

(Sing et al., 2009) - function
`performance`

in package`mlr`

(Bischl et al., 2016)

- function
- And, you can always compute them on the fly.

- Load house.data into R
- Apply clustering to the data
- How many clusters seems to be the optimal?

- Apply anomaly detection to the data
- Do you catch a few or many anomalies?