set.seed(1024)
27/04/2020
set.seed(1024)
library(fpc) library(dplyr) library(ggplot2) library(DMwR2)
\([Q_1-1.5\times IQR, Q_3+1.5\times IQR]\)
\(\displaystyle z=\frac{|x-\bar x|}{s_x}\)
\(\tau = t^2_{\alpha/(2N),N-2}\)
\(\displaystyle z\geq \frac{N-1}{\sqrt N} \sqrt{\frac {\tau} {N-2+\tau}}\)
case is an outlier if this inequality holds.
outliers
as grubbs.test()
dbscan.outliers <- function(data, ...) { require(fpc, quietly=TRUE) cl <- dbscan(data, ...) posOuts <- which(cl$cluster == 0) list(positions = posOuts, outliers = data[posOuts,], dbscanResults = cl) }
load("house.data") # loads houseData from file names(houseData)
## [1] "MustakilMi" "OrijinalAlan" "BanyoSayisi" "OdaSayisi" "SalonSayisi" ## [6] "ToplamKat" "GercekYas" "FiyatTL"
outs <- dbscan.outliers(houseData, eps = 3, scale=TRUE) outs$positions
## [1] 24 65 100 174 190 271
houseData$outlier = 0 houseData$outlier[outs$positions] = 1
ggplot(houseData, aes( x = GercekYas, y = FiyatTL, color = as.factor(outlier))) + geom_point() + theme(legend.position="bottom")
houseData$outlier = NULL outs <- outliers.ranking(scale(houseData)) outs$rank.outliers[1:10]
## [1] 2 46 56 133 180 198 241 251 45 204
houseData$outlier <- 0 houseData$outlier[ outs$rank.outliers[1:10]] <- 1
ggplot(houseData, aes( x = GercekYas, y = FiyatTL, color = as.factor(outlier))) + geom_point() + theme(legend.position="bottom")
lofactor
in the book packagehouseData$outlier = NULL out.scores <- lofactor(scale(houseData), 15) top_outliers <- order(out.scores, decreasing = T)[1:10] top_outliers
## [1] 243 24 100 132 266 174 190 65 248 57
houseData$outlier <- 0 houseData$outlier[top_outliers] <- 1
ggplot(houseData, aes( x = GercekYas, y = FiyatTL, color = as.factor(outlier))) + geom_point() + theme(legend.position="bottom")
Major problem : Imbalance!
Using the data at hand, build a model which can be used to predict the value of a response variable based on the values of input variables.
Mainly two types:
Ordinals may go into one of these categories.
Mostly, predictive analysis is curve fitting.
\(f(X_1, X_2, ..., X_k) \rightarrow Y\)
Overall approach:
Why choose one model over another?
\(c_1\) | \(c_2\) | \(c_3\) | |
---|---|---|---|
\(c_1\) | a | b | c |
\(c_2\) | d | e | f |
\(c_3\) | g | h | i |
\[L_{0/1} = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}}{I(\hat h(x_i) \neq y_i)}\]
where,
\[Acc = 1-L_{0/1}\]
\(c_1\) | \(c_2\) | \(c_3\) | |
---|---|---|---|
\(c_1\) | |||
\(c_2\) | |||
\(c_3\) |
\(\displaystyle Acc = \frac{a+e+i}{N_{test}}\)
\(c_1\) | \(c_2\) | \(c_3\) | |
---|---|---|---|
\(c_1\) | \(B_{1,1}\) | \(C_{1,2}\) | \(C_{1,3}\) |
\(c_2\) | \(C_{2,1}\) | \(B_{2,2}\) | \(C_{2,3}\) |
\(c_3\) | \(C_{3,1}\) | \(C_{3,2}\) | \(B_{3,3}\) |
Utility is computed as \[U = \sum^{n_c}_{i=1}{\sum^{n_c}_{k=1}{CM_{i,k}\times CB_{i,k}}}\]
CM: Confusion matrix
CB: Cost/benefit matrix
Standard CB matrix:
outlier | normal | |
---|---|---|
outlier | 1 | 0 |
normal | 0 | 1 |
An example CB matrix for outlier detection:
outlier | normal | |
---|---|---|
outlier | 5 | -5 |
normal | -1 | 0.1 |
T | F | |
---|---|---|
T | TP | FN |
F | FP | TN |
Prec = \(\frac{TP}{TP+FP}\)
Rec = \(\frac{TP}{TP+FN}\)
\(F_\beta = \frac{(\beta^2+1)\times Prec \times Rec}{\beta^2\times Prec + Rec}\)
\(\displaystyle MSE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}\)
\(\displaystyle RMSE = \sqrt{MSE}\)
\(\displaystyle MAE = \frac{1}{N_{test}}{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}\)
\(\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{(\hat y_i - y_i)^2}}{\sum^{N_{test}}_{i=1}{(\bar y_i - y_i)^2}}\)
\(\displaystyle NMSE = \frac{\sum^{N_{test}}_{i=1}{|\hat y_i - y_i|}}{\sum^{N_{test}}_{i=1}{|\bar y_i - y_i|}}\)
mmetric
in package rminer
(Cortez, 2015)classificationMetrics
and regressionMetrics
in package performanceEstimation
(Torgo, 2014a)performance
in package ROCR
(Sing et al., 2009)performance
in package mlr
(Bischl et al., 2016)