set.seed(1024)

11/05/2020

set.seed(1024)

library(rpart) library(rpart.plot) library(mlbench) library(DMwR2) library(e1071)

**Interpretable**results- Reasonable accuracy
- Applicable for both
**classification and regression**tasks - Works with both
**numeric and categorical**variables **Can handle NAs**- No assumption of the shape of the function
- Not top prediction performance
- Ensembles of trees have much better performance

- A
**hierarchy of logical tests**on variables- Is X > 5?
- Is color = green?
- Is birthplace in {Ankara, Istanbul, İzmir}?

- Each branch, including the root splits the data at hand into two
- Decreasing the total error rate

- The leaves contain results / predictions
- The path to a leaf is a conjunction of logical tests

## [1] "Id" "Cl.thickness" "Cell.size" "Cell.shape" ## [5] "Marg.adhesion" "Epith.c.size" "Bare.nuclei" "Bl.cromatin" ## [9] "Normal.nucleoli" "Mitoses" "Class"

The Gini index of a dataset D, where each example belongs to one of C classes:

\(\displaystyle Gini(D) = 1 - \sum_{i=1}^{C}{p_i^2}\)

- \(p_i\) is the observed frequency of class i.

- Consider a binary case where two classes are A and B

If D is split by a logical test s, then

\(\displaystyle Gini_s(D) = \frac{|D_s|}{|D|}Gini(D_s) + \frac{|D_{\neg s}|}{|D|}Gini(D_{\neg s})\)

Then, the reduction in impurity is given by

\(\Delta Gini_s(D) = Gini(D) - Gini_s(D)\)

- Information gain based on entropy is also frequently used

- For regression, LS is frequently used to measure error

\(\displaystyle Err(D) = \frac{1}{|D|} \sum_{ \langle x_i,y_i \rangle \in D}{(y_i - k_D)^2}\)

where \(k_D\) is the constant representing value of D.

It is shown that \(mean(y_i)\) actually minimizes LS.

If D is split by a logical test s, then

\(\displaystyle Err_s(D) = \frac{|D_s|}{|D|}Err(D_s) + \frac{|D_{\neg s}|}{|D|}Err(D_{\neg s})\)

Then, the reduction in impurity is given by

\(\Delta Err_s(D) = Err(D) - Err_s(D)\)

- When to stop?
- Too deep -> over-fitting, variance error
- Too shallow -> over-simplified, bias error

- Control with parameters
- leaf size
- split size
- depth
- complexity

- Grow a very large tree, then prune
- According to some statistical information

- Implemented in
`rpart`

and`party`

- We will use
`rpart`

- We will use
- Functions
`rpart()`

and`prune.rpart()`

- Book package contains
`rpartXse()`

which combines`rpart()`

and`prune.rpart()`

- applies post-prunning with X-SE rule

- A formula in R is provided in the following form

\(Y \sim X_1 + X_2 + X_3 + X_4...\)

- This means the value of Y depends on the values of Xs

\(Y \sim .\)

- means Y vs. everything else

Due to the certain randomized parts of the algorithm, it is possible to obtain slightly different trees between different runs.

Hence, always use a seed

`rpart.plot`

package allows nice drawings of DTs using`prp`

data(iris) ct1 <- rpartXse(Species ~ ., iris, model = TRUE) ct2 <- rpartXse(Species ~ ., iris, se = 0, model = TRUE)

`se=0`

is a less agressive prunning

par(mfrow=c(1,2)) prp(ct1, type = 0, extra = 101) prp(ct2, type = 0, extra = 101)

samp <- sample(1:nrow(iris), 120) tr_set <- iris[samp, ] tst_set <- iris[-samp, ] model <- rpartXse(Species ~ ., tr_set, se = 0.5) predicted <- predict(model, tst_set, type = "class") head(predicted)

## 12 15 35 37 40 43 ## setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica

table(tst_set$Species, predicted)

## predicted ## setosa versicolor virginica ## setosa 8 0 0 ## versicolor 0 10 1 ## virginica 0 0 11

errorRate <- sum(predicted != tst_set$Species) / nrow(tst_set) errorRate

## [1] 0.03333333

- Linearly separable sets

- Linearly non-separable sets
- Lift to a higher dimension using a non-linear function