11/05/2020

Seed used in these slides

set.seed(1024)

Libraries used in these slides

library(rpart)
library(rpart.plot)
library(mlbench)
library(DMwR2)
library(e1071)

Tree Based Models

Properties

  • Interpretable results
  • Reasonable accuracy
  • Applicable for both classification and regression tasks
  • Works with both numeric and categorical variables
  • Can handle NAs
  • No assumption of the shape of the function
  • Not top prediction performance
    • Ensembles of trees have much better performance

Shape

  • A hierarchy of logical tests on variables
    • Is X > 5?
    • Is color = green?
    • Is birthplace in {Ankara, Istanbul, İzmir}?
  • Each branch, including the root splits the data at hand into two
    • Decreasing the total error rate
  • The leaves contain results / predictions
  • The path to a leaf is a conjunction of logical tests

Example

##  [1] "Id"              "Cl.thickness"    "Cell.size"       "Cell.shape"     
##  [5] "Marg.adhesion"   "Epith.c.size"    "Bare.nuclei"     "Bl.cromatin"    
##  [9] "Normal.nucleoli" "Mitoses"         "Class"

Algorithm

The recursive partitioning algorithm

Find Best Split: GINI index

The Gini index of a dataset D, where each example belongs to one of C classes:

\(\displaystyle Gini(D) = 1 - \sum_{i=1}^{C}{p_i^2}\)

  • \(p_i\) is the observed frequency of class i.
  • Consider a binary case where two classes are A and B

GINI index

If D is split by a logical test s, then

\(\displaystyle Gini_s(D) = \frac{|D_s|}{|D|}Gini(D_s) + \frac{|D_{\neg s}|}{|D|}Gini(D_{\neg s})\)

Then, the reduction in impurity is given by

\(\Delta Gini_s(D) = Gini(D) - Gini_s(D)\)

  • Information gain based on entropy is also frequently used

Least Squares

  • For regression, LS is frequently used to measure error

\(\displaystyle Err(D) = \frac{1}{|D|} \sum_{ \langle x_i,y_i \rangle \in D}{(y_i - k_D)^2}\)

where \(k_D\) is the constant representing value of D.

  • It is shown that \(mean(y_i)\) actually minimizes LS.

  • If D is split by a logical test s, then

\(\displaystyle Err_s(D) = \frac{|D_s|}{|D|}Err(D_s) + \frac{|D_{\neg s}|}{|D|}Err(D_{\neg s})\)

Then, the reduction in impurity is given by

\(\Delta Err_s(D) = Err(D) - Err_s(D)\)

Termination

  • When to stop?
    • Too deep -> over-fitting, variance error
    • Too shallow -> over-simplified, bias error
  • Control with parameters
    • leaf size
    • split size
    • depth
    • complexity
  • Grow a very large tree, then prune
    • According to some statistical information

Implementation

  • Implemented in rpart and party
    • We will use rpart
  • Functions
    • rpart() and prune.rpart()
  • Book package contains
    • rpartXse() which combines rpart() and prune.rpart()
    • applies post-prunning with X-SE rule

Formula

  • A formula in R is provided in the following form

\(Y \sim X_1 + X_2 + X_3 + X_4...\)

  • This means the value of Y depends on the values of Xs

\(Y \sim .\)

  • means Y vs. everything else

Randomicity

  • Due to the certain randomized parts of the algorithm, it is possible to obtain slightly different trees between different runs.

  • Hence, always use a seed

  • rpart.plot package allows nice drawings of DTs using prp

Example

data(iris)
ct1 <- rpartXse(Species ~ ., iris, model = TRUE)
ct2 <- rpartXse(Species ~ ., iris, se = 0, model = TRUE)
  • se=0 is a less agressive prunning

Example

par(mfrow=c(1,2))
prp(ct1, type = 0, extra = 101)
prp(ct2, type = 0, extra = 101)

Example

samp <- sample(1:nrow(iris), 120)
tr_set <- iris[samp, ]
tst_set <- iris[-samp, ]
model <- rpartXse(Species ~ ., tr_set, se = 0.5)
predicted <- predict(model, tst_set, type = "class")
head(predicted)
##     12     15     35     37     40     43 
## setosa setosa setosa setosa setosa setosa 
## Levels: setosa versicolor virginica
table(tst_set$Species, predicted)
##             predicted
##              setosa versicolor virginica
##   setosa          8          0         0
##   versicolor      0         10         1
##   virginica       0          0        11
errorRate <- sum(predicted != tst_set$Species) / nrow(tst_set)
errorRate
## [1] 0.03333333

Support Vector Machines

Support Vector Machines

  • Linearly separable sets

Support Vector Machines

  • Linearly non-separable sets
    • Lift to a higher dimension using a non-linear function