set.seed(1024)
11/05/2020
set.seed(1024)
library(rpart) library(rpart.plot) library(mlbench) library(DMwR2) library(e1071)
## [1] "Id" "Cl.thickness" "Cell.size" "Cell.shape" ## [5] "Marg.adhesion" "Epith.c.size" "Bare.nuclei" "Bl.cromatin" ## [9] "Normal.nucleoli" "Mitoses" "Class"
The recursive partitioning algorithm
The Gini index of a dataset D, where each example belongs to one of C classes:
\(\displaystyle Gini(D) = 1 - \sum_{i=1}^{C}{p_i^2}\)
If D is split by a logical test s, then
\(\displaystyle Gini_s(D) = \frac{|D_s|}{|D|}Gini(D_s) + \frac{|D_{\neg s}|}{|D|}Gini(D_{\neg s})\)
Then, the reduction in impurity is given by
\(\Delta Gini_s(D) = Gini(D) - Gini_s(D)\)
\(\displaystyle Err(D) = \frac{1}{|D|} \sum_{ \langle x_i,y_i \rangle \in D}{(y_i - k_D)^2}\)
where \(k_D\) is the constant representing value of D.
It is shown that \(mean(y_i)\) actually minimizes LS.
If D is split by a logical test s, then
\(\displaystyle Err_s(D) = \frac{|D_s|}{|D|}Err(D_s) + \frac{|D_{\neg s}|}{|D|}Err(D_{\neg s})\)
Then, the reduction in impurity is given by
\(\Delta Err_s(D) = Err(D) - Err_s(D)\)
rpart
and party
rpart
rpart()
and prune.rpart()
rpartXse()
which combines rpart()
and prune.rpart()
\(Y \sim X_1 + X_2 + X_3 + X_4...\)
\(Y \sim .\)
Due to the certain randomized parts of the algorithm, it is possible to obtain slightly different trees between different runs.
Hence, always use a seed
rpart.plot
package allows nice drawings of DTs using prp
data(iris) ct1 <- rpartXse(Species ~ ., iris, model = TRUE) ct2 <- rpartXse(Species ~ ., iris, se = 0, model = TRUE)
se=0
is a less agressive prunningpar(mfrow=c(1,2)) prp(ct1, type = 0, extra = 101) prp(ct2, type = 0, extra = 101)
samp <- sample(1:nrow(iris), 120) tr_set <- iris[samp, ] tst_set <- iris[-samp, ] model <- rpartXse(Species ~ ., tr_set, se = 0.5) predicted <- predict(model, tst_set, type = "class") head(predicted)
## 12 15 35 37 40 43 ## setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica
table(tst_set$Species, predicted)
## predicted ## setosa versicolor virginica ## setosa 8 0 0 ## versicolor 0 10 1 ## virginica 0 0 11
errorRate <- sum(predicted != tst_set$Species) / nrow(tst_set) errorRate
## [1] 0.03333333