19/04/2020

Packages used in these slides

library(arules)
library(dplyr)
library(arulesViz)
library(cluster)
library(fpc)

Seed used in these slides

set.seed(1024)

Dependency Modeling

Dependency Modeling

  • Finding relations between groups of variables
  • Market basket analysis
    • Associations between products
    • Which products are bought together?
  • Association rules analysis
    • a.k.a. association rule mining

Dependency Modeling

  • We have a set of transactions \(D\)
  • Each transaction is a set of items, \(i \in I\)
    • Each item is indeed a binary representing existence in the basket
  • An association rule is an implication \[ X \rightarrow Y \] where
    • \(X,Y\subseteq I\)
    • \(X \neq \emptyset, Y \neq \emptyset\)
    • \(X \cap Y = \emptyset\)
  • \(X\) is known as the antecedent
  • \(Y\) is known as the consequent

Support

  • Support of an item set \(sup(X)\) is the proportion of the transactions including \(X\).

    \(sup(X) = P(X)\)

  • Support of an association rule is

    \(sup(X \rightarrow Y) = P(X \cup Y)\)

Confidence

  • Confidence of an association rule is

    \(\displaystyle conf(X \rightarrow Y) = \frac{sup(X \cup Y)} {sup(X)}\)

    \(=P(Y|X)\)

Apriori Algorithm

Approach

  • Find all frequent itemsets above a certain minimal support, minsup

  • Generate candidate association rules from these itemsets

  • Problem:

    • For \(k\) different items, we have \(2^k-1\) distinct subsets
    • Brute force search is impossible

Apriori Algorithm

  • Agrawal and Srikant, 1994

  • Basic idea

    • If an itemset is frequent then its subsets are also frequent

    \(sup(X\cup Y) \geq minsup \implies sup(X) \geq minsup \wedge sup(Y) \geq minsup\)

    because, \(sup(X) \geq sup(X \cup Y)\)

    • Second step: Find a partiton \((s, i-s)\) of a frequent itemset \(i\) such that

    \(conf(s \rightarrow (i-s)) \geq minconf\)

    \(\displaystyle conf(s \rightarrow (i-s)) = \frac{sup(s \cup (i-s))}{sup(s)} \geq minconf\)

    \(\displaystyle \frac{sup(i)}{sup(s)} \geq minconf\)

Apriori Algorithm

  • There are many other assoication rule algorithm implementations
  • Package arules provides an interface to some of these implementations

arules

Example

  • Before applying apriori algorithm to a dataset, we need to first transform it into transaction format
    • All numeric variables are discretized into categorical variables
      • Requires domain expertise
    • All categorical variables are converted to binary variables
    • Each binary variable now represents an item’s existence

Example

data(Boston, package="MASS")
b <- Boston
head(b)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Example

b$chas <- factor(b$chas, labels = c("river", "noriver"))
b$rad <- factor(b$rad)
b$black <- cut(b$black, breaks = 4, 
               labels = c(">31.5%", "18.5-31.5%", "8-18.5%", "<8%"))

discr <- function(x) cut(x, breaks = 4,
                         labels = c("low", "medLow", "medHigh", "high"))
b <- select(b, -one_of(c("chas", "rad", "black"))) %>%
  mutate_all(list(~discr(.))) %>%
  bind_cols(select(b, one_of(c("chas", "rad", "black"))))
b <- as(b, "transactions")
b 
## transactions in sparse format with
##  506 transactions (rows) and
##  59 items (columns)

Example

summary(b)
## transactions as itemMatrix in sparse format with
##  506 rows (elements/itemsets/transactions) and
##  59 columns (items) and a density of 0.2372881 
## 
## most frequent items:
##   crim=low chas=river  black=<8%     zn=low    dis=low    (Other) 
##        491        471        452        429        305       4936 
## 
## element (itemset/transaction) length distribution:
## sizes
##  14 
## 506 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      14      14      14      14      14      14 
## 
## includes extended item information - examples:
##         labels variables  levels
## 1     crim=low      crim     low
## 2  crim=medLow      crim  medLow
## 3 crim=medHigh      crim medHigh
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Example

itemFrequencyPlot(b, support = 0.3, cex.names = 0.8)

Example

Now we can run the apriori algorithm:

ars <- apriori(b, parameter = list(
  support = 0.025, confidence = 0.75, maxlen = 5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5   0.025      1
##  maxlen target   ext
##       5  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 12 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[59 item(s), 506 transaction(s)] done [0.00s].
## sorting and recoding items ... [52 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [76499 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].

Example

inspect(head(ars))
##     lhs                rhs          support    confidence lift     count
## [1] {}              => {zn=low}     0.84782609 0.8478261  1.000000 429  
## [2] {}              => {black=<8%}  0.89328063 0.8932806  1.000000 452  
## [3] {}              => {chas=river} 0.93083004 0.9308300  1.000000 471  
## [4] {}              => {crim=low}   0.97035573 0.9703557  1.000000 491  
## [5] {black=8-18.5%} => {age=high}   0.02766798 0.9333333  1.802545  14  
## [6] {black=8-18.5%} => {dis=low}    0.02766798 0.9333333  1.548415  14
inspect(head(subset(ars, rhs %in% "medv=high"), 5, by="confidence"))
##     lhs              rhs            support confidence    lift count
## [1] {rm=high,                                                       
##      ptratio=low} => {medv=high} 0.02964427          1 15.8125    15
## [2] {rm=high,                                                       
##      ptratio=low,                                                   
##      lstat=low}   => {medv=high} 0.02964427          1 15.8125    15
## [3] {rm=high,                                                       
##      ptratio=low,                                                   
##      black=<8%}   => {medv=high} 0.02964427          1 15.8125    15
## [4] {crim=low,                                                      
##      rm=high,                                                       
##      ptratio=low} => {medv=high} 0.02964427          1 15.8125    15
## [5] {rm=high,                                                       
##      ptratio=low,                                                   
##      lstat=low,                                                     
##      black=<8%}   => {medv=high} 0.02964427          1 15.8125    15

Lift

\(\displaystyle conf(X \rightarrow Y) = \frac{sup(X \cup Y)} {sup(X)}\)

\(\displaystyle lift(X\rightarrow Y) = \frac{sup(X \cup Y)}{sup(X)sup(Y)}\)

\(\displaystyle lift(X\rightarrow Y) = \frac{conf(X \rightarrow Y)}{sup(Y)}\)

\(\displaystyle lift(X\rightarrow Y) = \frac{P(Y|X)}{P(Y)}\)

  • If Y’s frequency is very high in the dataset, its confidence will be usually high
  • Lift fixes this
  • Lift is high iff conditioning on X improves Y’s probability

Example

inspect(head(subset(ars, 
                    subset = lhs %in% "nox=high" | rhs %in% "nox=high"), 
             5, by="confidence"))
##     lhs           rhs             support    confidence lift     count
## [1] {nox=high} => {indus=medHigh} 0.04743083 1          3.066667 24   
## [2] {nox=high} => {age=high}      0.04743083 1          1.931298 24   
## [3] {nox=high} => {dis=low}       0.04743083 1          1.659016 24   
## [4] {nox=high} => {zn=low}        0.04743083 1          1.179487 24   
## [5] {nox=high} => {crim=low}      0.04743083 1          1.030550 24

Visualization

  • use arulesViz package
plot(ars)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Other functions

  • quality() returns a data frame of measures by itemsets
  • size() returns the length of lhs or rhs
goodRules <- ars[quality(ars)$confidence > 0.8 & quality(ars)$lift > 5]
inspect(head(goodRules, 3, by="confidence"))
##     lhs                      rhs           support    confidence lift     
## [1] {nox=high,rad=5}      => {ptratio=low} 0.03162055 1           8.724138
## [2] {nox=high,tax=medLow} => {ptratio=low} 0.03162055 1           8.724138
## [3] {rm=high,ptratio=low} => {medv=high}   0.02964427 1          15.812500
##     count
## [1] 16   
## [2] 16   
## [3] 15
shortRules <- subset(ars, subset = size(lhs) + size(rhs) <= 3)
inspect(head(shortRules, 3, by="confidence"))
##     lhs                rhs          support    confidence lift     count
## [1] {black=8-18.5%} => {zn=low}     0.02964427 1          1.179487 15   
## [2] {black=8-18.5%} => {chas=river} 0.02964427 1          1.074310 15   
## [3] {zn=medHigh}    => {nox=low}    0.03162055 1          2.530000 16

Other functions

  • lhs %in% X: lhs contains an item from X
  • lhs %ain% X: lhs contains all items in X
  • lhs %oin% X: lhs contains only items from X
  • lhs %pin% X: lhs contains items partially matching from X

Other functions

somerules <- subset(goodRules, subset = 
                      lhs %pin% "crim" &
                      rhs %oin% c("medv=high", "medv=medHigh"))
inspect(head(somerules, 3, by="confidence"))
##     lhs                                         rhs         support   
## [1] {crim=low,rm=high,ptratio=low}           => {medv=high} 0.02964427
## [2] {crim=low,rm=high,ptratio=low,lstat=low} => {medv=high} 0.02964427
## [3] {crim=low,rm=high,ptratio=low,black=<8%} => {medv=high} 0.02964427
##     confidence lift    count
## [1] 1          15.8125 15   
## [2] 1          15.8125 15   
## [3] 1          15.8125 15

Matrix Visualization

somerules <- subset(ars, subset = (lhs %in% "nox=high" & 
                      size(lhs) < 3 & lift >= 1))
plot(somerules, method="matrix", measure="lift")

Graph Visualization

somerules <- subset(ars, subset = rhs %in% "medv=high" & lift > 15)
plot(somerules, method="graph")

Clustering

Clustering

  • Partitioning of a dataset into groups
    • Union of groups is the whole dataset
    • Each observation belongs to exactly one group
    • Observations within a group are similar
    • Observations from different groups are different
  • How to measure similarity and difference?
    • similar = close
    • different = distant

Distance measures

  • Euclidean distance

\(\displaystyle d(x,y) = \sqrt{ \sum^d_{i=1}{(x_i-y_i)^2} }\)

where \(x_i\) is the value of variable \(i\) on observation \(x\).

  • Euclidean distance is a special form of Minkowski distance

\(\displaystyle d(x,y) = \left (\sum^d_{i=1}{|x_i - y_i|^p}\right )^{1/p}\)

where \(x_i\) is the value of variable \(i\) on observation \(x\).

  • p = 1 -> Manhattan distance
  • p = 2 -> Euclidean distance

Distance measures

randDat <- matrix(rnorm(50), nrow = 5)
dist(randDat)     # Euclidean distance by default
##          1        2        3        4
## 2 5.751444                           
## 3 4.167964 4.087795                  
## 4 6.320828 4.483649 5.698951         
## 5 5.842614 4.734937 4.558000 5.846534
dist(randDat, method="manhattan")
##           1         2         3         4
## 2 14.829505                              
## 3  9.538933 10.806397                    
## 4 15.929926 10.904098 16.781092          
## 5 14.420665 10.754002 11.112255 16.059797
dist(randDat, method="minkowski", p = 4)
##          1        2        3        4
## 2 4.296051                           
## 3 3.499051 2.868943                  
## 4 4.331027 3.532843 3.475607         
## 5 4.028562 3.911365 3.583005 3.827518

Distance measures

  • What to do with categorical variables?

    • For ordinals we may still compute minkowksi distances
    • For nominals
      • 0 if values are different
      • 1 if values are the same
    • Hybrid data

    \(\delta_i(v_1, v_2)=\begin{cases} 1 & i \text{ is nominal and }v_1 \neq v_2 \\ 0 & i \text{ is nominal and }v_1 = v_2 \\ \frac{|v_1-v_2|}{range(i)} & i \text{ is numeric} \end{cases}\)

    • This makes sure all distances lie in \([0,1]\)
      • Implemented in daisy() function of the cluster package.

Methods

  • Partitioning methods
    • Partition into k clusters
  • Hierarchical methods
    • Provide alternative clusterings
    • Divisive vs. agglomerative
  • Density-based methods
    • Find high density regions of the feature space
  • Grid-based methods
    • Compute clusters within grid cells
    • Highly efficient
    • Combined with other methods

Measures

  • How to measure the quality of a clustering?

    • Assume \(h(c)\) -> score of cluster c

    \(H(C,k) = \sum_{c\in C}{\frac{h(c)}{|C|}}\)

    \(H(C,k) = \min_{c\in C}{h(c)}\)

    \(H(C,k) = \max_{c\in C}{h(c)}\)

Measures

  • How to measure \(h(c)\) ?

    • The sum of squares to the center of each cluster

    \(\displaystyle h(c) = \sum_{x\in c}{\sum_{i=1}^d {(v_{x,i}-\hat{v}_i^c)^2}}\)

    • The \(L_1\) measure

    \(\displaystyle h(c) = \sum_{x\in c}{\sum_{i=1}^d {|v_{x,i}-\tilde{v}_i^c|}}\)

k-means

  • Idea
    • Randomly assign observations to k clusters
    • Repeat
      • Compute cluster centers
      • Assign each observation to the closest center
    • Until clusters are stable
  • Tries to maximize the inter-cluster distance
    • Does not necessarily guarantee optimality

Example

data(iris)
ir3 <- kmeans(iris[, -5], centers = 3, iter.max = 200)
ir3
## K-means clustering with 3 clusters of sizes 38, 50, 62
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.006000    3.428000     1.462000    0.246000
## 3     5.901613    2.748387     4.393548    1.433871
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualization

plot(iris$Sepal.Length, iris$Sepal.Width, col = ir3$cluster)

Validation

  • External information
table(ir3$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1      0          2        36
##   2     50          0         0
##   3      0         48        14
  • Internal information - the silhouette method

\(s_i=\frac{b_i-a_i}{\max(a_i, b_i)}\)

where

  • \(a_i\) is the average distance of observation i to observations within its cluster
  • \(b_i\) is the average distance of observation i to observations in other clusters

Validation

  • silhouette is implemented in cluster
s <- silhouette(ir3$cluster, dist(iris[,-5]))
s[1:20, 3]
##  [1] 0.8529551 0.8154948 0.8293151 0.8050139 0.8493016 0.7482804 0.8216509
##  [8] 0.8539051 0.7521501 0.8252940 0.8031030 0.8359126 0.8105639 0.7461505
## [15] 0.7025937 0.6437716 0.7756839 0.8510183 0.7068578 0.8203012
plot(s)