18/05/2020

Seed used in these slides

set.seed(1024)

Libraries used in these slides

library(adabag)
library(mlbench)
library(randomForest)

Model Ensembles

Model Ensembles

  • Philosophy:
    • No matter how wise, one person cannot know everything!
    • Ensembles of less wise people produce better wisdom.
  • Why?

Model Ensembles

One superwise person’s decisions vs. one hundred barely wise persons’ decisions

SuperWise <- 0.9
BarelyWise <- 0.6
x <- rbinom(100, 1, SuperWise)
y <- rbinom(100, 100, BarelyWise) / 100
y <- ifelse(y > 0.5, 1, 0)
table(x)
## x
##  0  1 
## 10 90
table(y)
## y
##  0  1 
##  4 96
cat(sum(x) , sum(y))
## 90 96

Bootstrap Aggregating - Bagging

Idea:

  • Sample the original dataset uniformly and with replacement to obtain k training sets
    • These are called bootstraps
  • Train a model with each
    • Usually the decision tree is used
  • Take the average for regression, or vote for classification

Bootstrap Aggregating - Bagging

  • implemented in package adabag
# BreastCancer data from mlbench package
data(BreastCancer, package = "mlbench")
# use only the complete cases and remove the ID column
bc <- BreastCancer[complete.cases(BreastCancer), -1]
# Obtain a 70-30 split for training and testing
rndSample <- sample(1:nrow(bc), nrow(bc) * 0.70)
tr <- bc[rndSample, ]
ts <- bc[-rndSample, ]
# Build the model (mfinal = number of trees)
m <- bagging(Class ~ ., tr, mfinal = 20,
             control = rpart.control(maxdepth=1))
ps <- predict(m,ts)
names(ps)
## [1] "formula"   "votes"     "prob"      "class"     "confusion" "error"
ps$confusion
##                Observed Class
## Predicted Class benign malignant
##       benign       124         7
##       malignant     14        60

Bagging

  • Why use trees with maxdepth = 1 ?
# Build the model (mfinal = number of trees)
m <- bagging(Class ~ ., tr, mfinal = 20,
             control = rpart.control(maxdepth=3))
ps <- predict(m, ts)
names(ps)
## [1] "formula"   "votes"     "prob"      "class"     "confusion" "error"
ps$confusion
##                Observed Class
## Predicted Class benign malignant
##       benign       131         7
##       malignant      7        60
  • They are faster to build
  • Complex trees will be more similar

Random Forest

  • Improved version of bagging
  • Each tree is grown with a subset of variables
    • Actually subset is randomly selected for each split
  • Very diverse set of trees obtained
  • Each tree is built very quickly
    • Each split considers only a few variables

Random Forest

  • implemented in randomForest
m <- randomForest(Class ~ ., tr, ntree = 100, mtry = 3)
ps <- predict(m,ts)
(cm <- table(ps, ts$Class))
##            
## ps          benign malignant
##   benign       132         0
##   malignant      6        67
  • parameter mtry controls the size of the feature subset
    • if not provided it is automatically calculated
    • for classification: sqrt of number of vars
    • for regression: one third of number of vars
  • Much faster than bagging
    • Due to easier split decision

RandomForest

  • How many trees is optimal?
error <- numeric()
nmodels <- 20
for (i in 1:nmodels)
{
  m <- randomForest(Class ~ ., tr, ntree = i, mtry = 3)
  ps <- predict(m, ts)
  cm <- table(ps, ts$Class)
  error[i] <- (cm[1,2]+cm[2,1])/nrow(ts)
}
par(mar=c(2,4,1,2))
plot(1:nmodels, error, type = "l")

Dependent vs Independent Ensembles

  • Both bagging and random forest are independent models
    • Individual models are completely independent and unaware of each other
  • There are also coordinated models
    • Where each member depends on the others
    • Each model improves the previous models
    • Boosting is a famous example

Boosting

  • Can many weak learners improve each other to form a strong learner?
  • At each iteration we add a new model to the ensemble
    • The new model is trained to predict observations which were hard to predict by the previous models
    • This is done by assigning weights to the observations

AdaBoost

  • Most well-known boosting algorithm
  • An additive system of models

\[H(x_i) = \sum_k{w_kh_k(x_i)}\]

  • implemented in adabag as boosting()

AdaBoost

  • AdaBoost.M1 algorithm
m <- boosting(Class ~ ., tr, mfinal = 20)
ps <- predict(m, ts)
ps$confusion
##                Observed Class
## Predicted Class benign malignant
##       benign       131         0
##       malignant      7        67
  • add parameter coeflearn = "Zhu" to run SAMME algorithm

AdaBoost

  • How many trees is optimal?
error <- numeric()
nmodels <- 20
for (i in 1:nmodels)
{
  m <- boosting(Class ~ ., tr, mfinal = i)
  ps <- predict(m, ts)
  error[i] <- ps$error
}
par(mar=c(2,4,1,2))
plot(1:nmodels, error, type = "l")

Gradient Boosting Machine

  • Yet another boosting implementation
  • This time using gradient descent optimization
  • implemented in package gbm