Se k Ilkin Serengil An introduction to digital pathology and the use of AI OPEN Code wins arguments Menu A Step by Step CHAID Decision Tree Example March 18, 2020 / Machine Learning Haven't you subscribe my YouTube channel yet? Follow me on Twitter Follow @serengil Sefik Ilkin Serengil YouTube 1K Ads by Stop seeing this ad Why this ad? Ads by Stop seeing this ad Wh thi d? CHAID is the oldest decision tree algorithm in the history. It was raised in 1980 by Gordon V. Kass. Then, CART was found in 1984, ID3 was proposed in 1986 and C4.5 was announced in 1993. It is the acronym of chi-square automatic interaction detection. Here, chi-square is a metric to nd the signi cance of a feature. The higher the value, the higher the statistical signi cance. Similar to the others, CHAID builds decision trees for classi cation problems. This means that it expects data sets having a categorical target variable. CHAID in Python Why this ad? Living trees in the Lord of the Rings (2001) This blog post mentions the deeply explanation of CHAID algorithm and we will solve a problem step by step. On the other hand, you might just want to run CHAID algorithm and its mathematical background might not attract your attention. Herein, you can nd the python implementation of CHAID algorithm here. This package supports the most common decision tree algorithms such as ID3, C4.5, CART or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost. Here, you can nd a hands-on video as well. CHAID in chefboost for python Objective Decision rules will be found based on chi-square values of features. CHAID Decision Tree Algorithm in Python CHAID Decision Tree Algorithm in Python Formula CHAID uses chi-square tests to nd the most dominant feature whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses GINI index. Chi-square testing was raised by Karl Pearson. He is also the founder of correlation. Today, most programming libraries (e.g. Pandas for Python) use Pear son metric for correlation by default. The formula of chi-square testing is easy. √ ((y – y’) / y’) where y is actual and y’ is expected. The Math Behind CHAID Decision Tree Alg The Math Behind CHAID Decision Tree Alg ... ... 2 Data set We are going to build decision rules for the following data set. Decision column is the target we would like to nd based on some features. BTW, we will ignore the day column because it just states the row number. Day Out look Temp. Humidit y Wind Decision 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No We need to nd the most dominant feature in this data set. Outlook feature Outlook feature has 3 classes: sunny, rain and overcast. There are 2 decisions: yes and no. We rstly nd the number of yes decisions and no decision for each class. Yes No Total Expected Chi-square Yes Chi-square No Sunny 2 3 5 2.5 0.316 0.316 Overcast 4 0 4 2 1.414 1.414 Rain 3 2 5 2.5 0.316 0.316 Total column is the sum of yes and no decisions for each row. Expected values are the half of total column because there are 2 classes in the decision. It is easy to calculate the chi-squared values based on this table. For example, chi-square yes for sunny outlook is √ ((2 – 2.5) / 2.5) = 0.316 whereas actual is 2 and expected is 2.5. 2 Chi-square value of outlook is the sum of chi-square yes and no columns. 0.316 + 0.316 + 1.414 + 1.414 + 0.316 + 0.316 = 4.092 Now, we will nd chi-square values for other features. The feature having the maximum chi-square value will be the decision point. Temperature feature This feature has 3 classes: hot, mild and cool. The following table summarizes the chi-square values for these classes. Yes No Total Expected Chi-square Yes Chi-square No Hot 2 2 4 2 0 0 Mild 4 2 6 3 0.577 0.577 OrCAD PCB for 80% Off Schematic + PCB + No Limits. EMA Design Automation Cool 3 1 4 2 0.707 0.707 Chi-square value of temperature feature will be 0 + 0 + 0.577 + 0.577 + 0.707 + 0.707 = 2.569 This is a value less than the chi-square value of outlook. This means that the feature outlook is more important than the feature temperature based on chi-square testing. Humidity feature Humidity has 2 classes: high and normal. Let’s summarize the chi-square values. Yes No Total Expected Chi-square Yes Chi-square No High 3 4 7 3.5 0.267 0.267 Normal 6 1 7 3.5 1.336 1.336 So, the chi-square value of humidity feature is 0.267 + 0.267 + 1.336 + 1.336 = 3.207 This is less than the chi-square value of outlook as well. What about wind feature? Wind feature Wind feature has 2 classes: weak and strong. The following table is the pivot table. Yes No Total Expected Chi-square Yes Chi-square No Weak 5 2 7 3.5 0.802 0.802 Strong 3 3 6 3 0.000 0.000 Herein, the chi-square test value of the wind feature is 0.802 + 0.802 + 0 + 0 = 1.604 We’ve found the chi square values of all features. Let’s see them all in a table. Ads by Stop seeing this ad Why this ad? Feature Chi-square value Outlook 4.092 Temperature 2.569 Humidity 3.207 Wind 1.604 As seen, outlook feature has the highest chi-square value. This means that it is the most signi cant feature. So, we will put this feature to the root node. We’ve ltered the raw data set based on the outlook classes on the illustration above. For example, overcast branch just has yes decisions in the sub data set. This means that CHAID tree returns YES if outlook is overcast. Initial form of CHAID tree The both sunny and rain branches have yes and no decisions. We will apply chi-square tests for these sub data sets. Outlook = Sunny branch This branch has 5 instances. Now, we look for the most dominant feature. BTW, we will ignore the outlook column now because they are all same. In other words, we will nd the most dominant feature among temperature, humidity and wind. Day Outlook Temp. Humidity Wind Decision 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 11 Sunny Mild Normal Strong Yes Temperature feature for sunny outlook Yes No Total Expected Chi-square Yes Chi-square No Hot 0 2 2 1 1 1 Mild 1 1 2 1 0 0 Cool 1 0 1 0.5 0.707 0.707 So, chi-square value of temperature feature for sunny outlook is 1 + 1 + 0 + 0 + 0.707 + 0.707 = 3.414 Humidity feature for sunny outlook Yes No Total Expected Chi-square Yes Chi-square No High 0 3 3 1.5 1.225 1.225 Normal 2 0 2 1 1 1 Chi-square value of humidity feature for sunny outlook is 1.225 + 1.225 + 1 + 1 = 4.449 Wind feature for sunny outlook Yes No Total Expected Chi-square Yes Chi-square No Weak 1 2 3 1.5 0.408 0.408 Strong 1 1 2 1 0 0 Chi-square value of wind feature for sunny outlook is 0.408 + 0.408 + 0 + 0 = 0.816 We’ve found chi-square values for sunny outlook. Let’s see them all in a table. Feature Chi-square Temperature 3.414 Humidity 4.449 Wind 0.816 Now, humidity is the most dominant feature for the sunny outlook branch. We will put this feature as a decision rule. Now, the both humidity branches for sunny outlook have just one decisions as illustrated above. CHAID tree will return NO for sunny outlook and high humidity and it will return YES for sunny outlook and normal humidity. Rain outlook branch The second phase of CHAID tree This branch still has the both yes and no decisions. We need to apply chi-square test for this branch to nd an exact decisions. This branch has 5 instances as shown in the following sub data set. Let’s nd the most dominant feature among temperature, humidity and wind. Day Outlook Temp. Humidity Wind Decision 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 10 Rain Mild Normal Weak Yes 14 Rain Mild High Strong No Temperature feature for rain outlook This feature has 2 classes: mild and cool. Notice that even though hot temperature appears in the raw data set, this branch has no hot instance. Yes No Total Expected Chi-square Yes Chi-square No Mild 2 1 3 1.5 0.408 0.408 Cool 1 1 2 1 0 0 Chi-square value of temperature feature for rain outlook is 0.408 + 0.408 + 0 + 0 = 0.816 Humidity feature for rain outlook This feature in this branch has 2 classes: high and normal. Yes No Total Expected Chi-square Yes Chi-square No OrCAD PCB for 80% Off Schematic + PCB + No Limits. EMA Design Automation High 1 1 2 1 0 0 Normal 2 1 3 1.5 0.408 0.408 Chi-square value of humidity feature for rain outlook is 0 + 0 + 0.408 + 0.408 = 0.816 Wind feature for rain outlook This feature in this branch has 2 classes: weak and strong. Yes No Total Expected Chi-square Yes Chi-square No Weak 3 0 3 1.5 1.225 1.225 Strong 0 2 2 1 1 1 So, chi-squre value of wind feature for rain outlook is 1.225 + 1.225 + 1 + 1 = 4.449 We’ve found all chi-square values for rain outlook branch.Let’s see them all in a single table. Feature Chi-squared Temperature 0.816 Humidity 0.816 Wind 4.449 So, wind feature is the winner for rain outlook branch. Put this feature in the related branch and see the corresponding sub data sets. As seen, all branches have sub data sets having a single decision. So, we can build the CHAID tree as illustrated below. The third phase of the CHAID tree