2024-04-30

Seed used in these slides

set.seed(1024)

Libraries used in these slides

library(rpart)
library(rpart.plot)
library(mlbench)
library(DMwR2)
library(e1071)

Tree Based Models

Properties

  • Interpretable results
  • Reasonable accuracy
  • Applicable for both classification and regression tasks
  • Works with both numeric and categorical variables
  • Can handle NAs
  • No assumption of the shape of the function
  • Not top prediction performance
    • Ensembles of trees have much better performance

Shape

  • A hierarchy of logical tests on variables
    • Is X > 5?
    • Is color = green?
    • Is birthplace in {Ankara, Istanbul, İzmir}?
  • Each branch, including the root splits the data at hand into two
    • Decreasing the total error rate
  • The leaves contain results / predictions
  • The path to a leaf is a conjunction of logical tests

Example

##  [1] "Id"              "Cl.thickness"    "Cell.size"       "Cell.shape"     
##  [5] "Marg.adhesion"   "Epith.c.size"    "Bare.nuclei"     "Bl.cromatin"    
##  [9] "Normal.nucleoli" "Mitoses"         "Class"

Algorithm

Find Best Split: GINI index

The Gini index of a dataset D, where each example belongs to one of C classes:

\(\displaystyle Gini(D) = 1 - \sum_{i=1}^{C}{p_i^2}\)

  • \(p_i\) is the observed frequency of class i.
  • Consider a binary case where two classes are A and B

GINI index

If D is split by a logical test s, then

\(\displaystyle Gini_s(D) = \frac{|D_s|}{|D|}Gini(D_s) + \frac{|D_{\neg s}|}{|D|}Gini(D_{\neg s})\)

Then, the reduction in impurity is given by

\(\Delta Gini_s(D) = Gini(D) - Gini_s(D)\)

  • Information gain based on entropy is also frequently used

Least Squares

  • For regression, LS is frequently used to measure error

\(\displaystyle Err(D) = \frac{1}{|D|} \sum_{ \langle x_i,y_i \rangle \in D}{(y_i - k_D)^2}\)

where \(k_D\) is the constant representing value of D.

  • It is shown that \(mean(y_i)\) actually minimizes LS.

  • If D is split by a logical test s, then

\(\displaystyle Err_s(D) = \frac{|D_s|}{|D|}Err(D_s) + \frac{|D_{\neg s}|}{|D|}Err(D_{\neg s})\)

Then, the reduction in impurity is given by

\(\Delta Err_s(D) = Err(D) - Err_s(D)\)

Termination

  • When to stop?
    • Too deep -> over-fitting, variance error
    • Too shallow -> over-simplified, bias error
  • Control with parameters
    • leaf size
    • split size
    • depth
    • complexity
  • Grow a very large tree, then prune
    • According to some statistical information

Implementation

  • Implemented in rpart and party
    • We will use rpart
  • Functions
    • rpart() and prune.rpart()
  • Book package contains
    • rpartXse() which combines rpart() and prune.rpart()
    • applies post-prunning with X-SE rule

Formula

  • A formula in R is provided in the following form

\(Y \sim X_1 + X_2 + X_3 + X_4...\)

  • This means the value of Y depends on the values of Xs

\(Y \sim .\)

  • means Y vs. everything else

Randomicity

  • Due to the certain randomized parts of the algorithm, it is possible to obtain slightly different trees between different runs.

  • Hence, always use a seed

  • rpart.plot package allows nice drawings of DTs using prp

Example

data(iris)
ct1 <- rpartXse(Species ~ ., iris, model = TRUE)
ct2 <- rpartXse(Species ~ ., iris, se = 0, model = TRUE)
  • se=0 is a less agressive prunning

Example

par(mfrow=c(1,2))
prp(ct1, type = 0, extra = 101)
prp(ct2, type = 0, extra = 101)

Example

samp <- sample(1:nrow(iris), 120)
tr_set <- iris[samp, ]
tst_set <- iris[-samp, ]
model <- rpartXse(Species ~ ., tr_set, se = 0.5)
predicted <- predict(model, tst_set, type = "class")
head(predicted)
##     12     15     35     37     40     43 
## setosa setosa setosa setosa setosa setosa 
## Levels: setosa versicolor virginica
table(tst_set$Species, predicted)
##             predicted
##              setosa versicolor virginica
##   setosa          8          0         0
##   versicolor      0         10         1
##   virginica       0          0        11
errorRate <- sum(predicted != tst_set$Species) / nrow(tst_set)
errorRate
## [1] 0.03333333

Support Vector Machines

Support Vector Machines

  • Linearly separable sets

Support Vector Machines

  • Linearly non-separable sets
    • Lift to a higher dimension using a non-linear function

Support Vector Machines

  • Questions
    • Which function to use?
    • Which hyperplane to choose?
      • The one that maximizes the separating margin

Support Vector Machines

  • Choosing the optimal hyperplane
    • Involves linear algebra and quadratic optimization
      • Lagrangian relaxation
      • Dual problem
      • Karush-Kuhn-Tucker conditions
    • Core operation is computing the dot product of two points (vectors)
      • Which can be very expensive after dimension expansion
    • We need to do this faster

Kernel Trick

  • Kernel trick

    • Consider two points \(x:\langle x_1, x_2 \rangle\) and \(z:\langle z_1, z_2 \rangle\)
    • Let \(\phi(x)\) be a nonlinear mapping of x to a higher dimension
    • We want to compute \(\phi(x)\cdot \phi(z)\)
    • Consider the following kernel function: \(K(x_i,x_j)=(x_i\cdot x_j)^2\)
    • Then

    \(K(x,z)=(\langle x_1, x_2 \rangle \cdot \langle z_1, z_2 \rangle)^2\)

    \(=(x_1z_1+x_2z_2)^2 = x_1^2z_1^2 + x_2^2z_2^2 + 2x_1x_2z_1z_2\)

    \(=\langle x_1^2, x_2^2, \sqrt{2}x_1x_2 \rangle \cdot \langle z_1^2, z_2^2, \sqrt{2}z_1z_2 \rangle\)

    • So, for \(\phi(\langle x_1,x_2 \rangle)=\langle x_1^2, x_2^2, \sqrt{2}x_1x_2 \rangle\) we have

    \(K(x,z)=\phi(x)\cdot \phi(z)\)

Kernel Function Families

  • This means, if we find these Kernel functions then we can use them for mapping our data to higher dimensions much faster.

  • Indeed there are many such kernel function families

    • Gaussian kernel

    \(K(x_i, x_j) = e^{-\frac{||x_i-x_j||^2}{2\sigma^2}}\)

    • Polynomial

    \(K(x_i, x_j) = (x_i\cdot x_j)^d\)

    • Radial kernel

    \(K(x_i,x_j) = e^{-\gamma||x_i - x_j||^2}\)

Support Vector Machines

  • How does SVM handle non-binary classification?
    • By solving multiple binary classification problems
  • Regression?
    • \(\epsilon\)-SV approach finds an optimal hyperplane where each data point lies within \(\epsilon\) distance of the hyperplane.

Implementation

  • Implemented in packages e1071 and kernlab.
    • They are quite similar. kernlab may be more flexible. e1071 is simpler.

Example

data(iris)
rndSample <- sample(1:nrow(iris), 100)
tr <- iris[rndSample, ]
ts <- iris[-rndSample, ]
s <- svm(Species ~ ., tr)
ps <- predict(s, ts)
(cm <- table(ps, ts$Species))
##             
## ps           setosa versicolor virginica
##   setosa         24          0         0
##   versicolor      0         14         0
##   virginica       0          1        11

Example

s2 <- svm(Species ~ ., tr, cost=10, kernel="polynomial", degree=3)
ps2 <- predict(s2, ts)
(cm2 <- table(ps2, ts$Species))
##             
## ps2          setosa versicolor virginica
##   setosa         24          0         0
##   versicolor      0         15         3
##   virginica       0          0         8