Lecture 10

18/05/2020

Seed used in these slides

set.seed(1024)

Libraries used in these slides

library(adabag)
library(mlbench)
library(randomForest)

Model Ensembles

Philosophy:
- No matter how wise, one person cannot know everything!
- Ensembles of less wise people produce better wisdom.
Why?

Model Ensembles

One superwise person’s decisions vs. one hundred barely wise persons’ decisions

SuperWise <- 0.9
BarelyWise <- 0.6
x <- rbinom(100, 1, SuperWise)
y <- rbinom(100, 100, BarelyWise) / 100
y <- ifelse(y > 0.5, 1, 0)
table(x)

## x
##  0  1 
## 10 90

table(y)

## y
##  0  1 
##  4 96

cat(sum(x) , sum(y))

## 90 96

Bootstrap Aggregating - Bagging

Idea:

Sample the original dataset uniformly and with replacement to obtain k training sets
- These are called bootstraps
Train a model with each
- Usually the decision tree is used
Take the average for regression, or vote for classification

Bootstrap Aggregating - Bagging

implemented in package adabag

# BreastCancer data from mlbench package
data(BreastCancer, package = "mlbench")
# use only the complete cases and remove the ID column
bc <- BreastCancer[complete.cases(BreastCancer), -1]
# Obtain a 70-30 split for training and testing
rndSample <- sample(1:nrow(bc), nrow(bc) * 0.70)
tr <- bc[rndSample, ]
ts <- bc[-rndSample, ]
# Build the model (mfinal = number of trees)
m <- bagging(Class ~ ., tr, mfinal = 20,
             control = rpart.control(maxdepth=1))
ps <- predict(m,ts)
names(ps)

## [1] "formula"   "votes"     "prob"      "class"     "confusion" "error"

ps$confusion

##                Observed Class
## Predicted Class benign malignant
##       benign       124         7
##       malignant     14        60

Bagging

Why use trees with maxdepth = 1 ?

# Build the model (mfinal = number of trees)
m <- bagging(Class ~ ., tr, mfinal = 20,
             control = rpart.control(maxdepth=3))
ps <- predict(m, ts)
names(ps)

## [1] "formula"   "votes"     "prob"      "class"     "confusion" "error"

ps$confusion

##                Observed Class
## Predicted Class benign malignant
##       benign       131         7
##       malignant      7        60

They are faster to build
Complex trees will be more similar

Random Forest

Improved version of bagging
Each tree is grown with a subset of variables
- Actually subset is randomly selected for each split
Very diverse set of trees obtained
Each tree is built very quickly
- Each split considers only a few variables

Random Forest

implemented in randomForest

m <- randomForest(Class ~ ., tr, ntree = 100, mtry = 3)
ps <- predict(m,ts)
(cm <- table(ps, ts$Class))

##            
## ps          benign malignant
##   benign       132         0
##   malignant      6        67

parameter mtry controls the size of the feature subset
- if not provided it is automatically calculated
- for classification: sqrt of number of vars
- for regression: one third of number of vars
Much faster than bagging
- Due to easier split decision

RandomForest

How many trees is optimal?

error <- numeric()
nmodels <- 20
for (i in 1:nmodels)
{
  m <- randomForest(Class ~ ., tr, ntree = i, mtry = 3)
  ps <- predict(m, ts)
  cm <- table(ps, ts$Class)
  error[i] <- (cm[1,2]+cm[2,1])/nrow(ts)
}
par(mar=c(2,4,1,2))
plot(1:nmodels, error, type = "l")

Dependent vs Independent Ensembles

Both bagging and random forest are independent models
- Individual models are completely independent and unaware of each other
There are also coordinated models
- Where each member depends on the others
- Each model improves the previous models
- Boosting is a famous example

Boosting

Can many weak learners improve each other to form a strong learner?
At each iteration we add a new model to the ensemble
- The new model is trained to predict observations which were hard to predict by the previous models
- This is done by assigning weights to the observations

AdaBoost

Most well-known boosting algorithm
An additive system of models

\[H(x_i) = \sum_k{w_kh_k(x_i)}\]

implemented in adabag as boosting()

AdaBoost

AdaBoost.M1 algorithm

m <- boosting(Class ~ ., tr, mfinal = 20)
ps <- predict(m, ts)
ps$confusion

##                Observed Class
## Predicted Class benign malignant
##       benign       131         0
##       malignant      7        67

add parameter coeflearn = "Zhu" to run SAMME algorithm

AdaBoost

How many trees is optimal?

error <- numeric()
nmodels <- 20
for (i in 1:nmodels)
{
  m <- boosting(Class ~ ., tr, mfinal = i)
  ps <- predict(m, ts)
  error[i] <- ps$error
}
par(mar=c(2,4,1,2))
plot(1:nmodels, error, type = "l")

Gradient Boosting Machine

Yet another boosting implementation
This time using gradient descent optimization
implemented in package gbm