18/05/2020

## Seed used in these slides

set.seed(1024)

## Libraries used in these slides

library(adabag)
library(mlbench)
library(randomForest)

## Model Ensembles

• Philosophy:
• No matter how wise, one person cannot know everything!
• Ensembles of less wise people produce better wisdom.
• Why?

## Model Ensembles

One superwise person’s decisions vs. one hundred barely wise persons’ decisions

SuperWise <- 0.9
BarelyWise <- 0.6
x <- rbinom(100, 1, SuperWise)
y <- rbinom(100, 100, BarelyWise) / 100
y <- ifelse(y > 0.5, 1, 0)
table(x)
## x
##  0  1
## 10 90
table(y)
## y
##  0  1
##  4 96
cat(sum(x) , sum(y))
## 90 96

## Bootstrap Aggregating - Bagging

Idea:

• Sample the original dataset uniformly and with replacement to obtain k training sets
• These are called bootstraps
• Train a model with each
• Usually the decision tree is used
• Take the average for regression, or vote for classification

## Bootstrap Aggregating - Bagging

• implemented in package adabag
# BreastCancer data from mlbench package
data(BreastCancer, package = "mlbench")
# use only the complete cases and remove the ID column
bc <- BreastCancer[complete.cases(BreastCancer), -1]
# Obtain a 70-30 split for training and testing
rndSample <- sample(1:nrow(bc), nrow(bc) * 0.70)
tr <- bc[rndSample, ]
ts <- bc[-rndSample, ]
# Build the model (mfinal = number of trees)
m <- bagging(Class ~ ., tr, mfinal = 20,
control = rpart.control(maxdepth=1))
ps <- predict(m,ts)
names(ps)
## [1] "formula"   "votes"     "prob"      "class"     "confusion" "error"
psconfusion ## Observed Class ## Predicted Class benign malignant ## benign 124 7 ## malignant 14 60 ## Bagging • Why use trees with maxdepth = 1 ? # Build the model (mfinal = number of trees) m <- bagging(Class ~ ., tr, mfinal = 20, control = rpart.control(maxdepth=3)) ps <- predict(m, ts) names(ps) ## [1] "formula" "votes" "prob" "class" "confusion" "error" psconfusion
##                Observed Class
## Predicted Class benign malignant
##       benign       131         7
##       malignant      7        60
• They are faster to build
• Complex trees will be more similar

## Random Forest

• Improved version of bagging
• Each tree is grown with a subset of variables
• Actually subset is randomly selected for each split
• Very diverse set of trees obtained
• Each tree is built very quickly
• Each split considers only a few variables

## Random Forest

• implemented in randomForest
m <- randomForest(Class ~ ., tr, ntree = 100, mtry = 3)
ps <- predict(m,ts)
(cm <- table(ps, tsClass)) ## ## ps benign malignant ## benign 132 0 ## malignant 6 67 • parameter mtry controls the size of the feature subset • if not provided it is automatically calculated • for classification: sqrt of number of vars • for regression: one third of number of vars • Much faster than bagging • Due to easier split decision ## RandomForest • How many trees is optimal? error <- numeric() nmodels <- 20 for (i in 1:nmodels) { m <- randomForest(Class ~ ., tr, ntree = i, mtry = 3) ps <- predict(m, ts) cm <- table(ps, tsClass)
error[i] <- (cm[1,2]+cm[2,1])/nrow(ts)
}
par(mar=c(2,4,1,2))
plot(1:nmodels, error, type = "l")

## Dependent vs Independent Ensembles

• Both bagging and random forest are independent models
• Individual models are completely independent and unaware of each other
• There are also coordinated models
• Where each member depends on the others
• Each model improves the previous models
• Boosting is a famous example

## Boosting

• Can many weak learners improve each other to form a strong learner?
• At each iteration we add a new model to the ensemble
• The new model is trained to predict observations which were hard to predict by the previous models
• This is done by assigning weights to the observations

• Most well-known boosting algorithm
• An additive system of models

$H(x_i) = \sum_k{w_kh_k(x_i)}$

• implemented in adabag as boosting()

m <- boosting(Class ~ ., tr, mfinal = 20)
psconfusion ## Observed Class ## Predicted Class benign malignant ## benign 131 0 ## malignant 7 67 • add parameter coeflearn = "Zhu" to run SAMME algorithm ## AdaBoost • How many trees is optimal? error <- numeric() nmodels <- 20 for (i in 1:nmodels) { m <- boosting(Class ~ ., tr, mfinal = i) ps <- predict(m, ts) error[i] <- pserror
plot(1:nmodels, error, type = "l")
• implemented in package gbm