Details: Category: Undergrad; Published: 04 April 2014

Problem 3: Heart Disease Classification

This is the accompanying document to the heart disease classification report for the computational intelligence coursework by student 10011412. It details the process taken in modelling and gaining insights into the heart disease data.

Formatting Data

Initially, we will first import the data into R and label it

Training Data

trainD <- read.table("C:/Users/Manny/Documents/Manny/Studies/Uni 4th yr - 2nd Tri/CI/CourseWork/NN/cwTRAIN.txt", 
    header = FALSE)
colnames(trainD) <- c("Age", "Sex", "CPType", "RestingBP", "SerumChol", "FBS", 
    "RestingECG", "MaxHR", "EIA", "STDep", "PES", "BloodVesNo", "thal", "HD")
summary(trainD)

##       Age            Sex            CPType       RestingBP     SerumChol  
##  Min.   :29.0   Min.   :0.000   Min.   :1.00   Min.   : 94   Min.   :126  
##  1st Qu.:47.8   1st Qu.:0.000   1st Qu.:3.00   1st Qu.:120   1st Qu.:213  
##  Median :55.0   Median :1.000   Median :3.00   Median :130   Median :243  
##  Mean   :54.4   Mean   :0.668   Mean   :3.16   Mean   :131   Mean   :249  
##  3rd Qu.:61.0   3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:140   3rd Qu.:275  
##  Max.   :77.0   Max.   :1.000   Max.   :4.00   Max.   :200   Max.   :564  
##       FBS         RestingECG       MaxHR          EIA       
##  Min.   :0.00   Min.   :0.00   Min.   : 71   Min.   :0.000  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:132   1st Qu.:0.000  
##  Median :0.00   Median :2.00   Median :154   Median :0.000  
##  Mean   :0.15   Mean   :1.03   Mean   :149   Mean   :0.345  
##  3rd Qu.:0.00   3rd Qu.:2.00   3rd Qu.:166   3rd Qu.:1.000  
##  Max.   :1.00   Max.   :2.00   Max.   :202   Max.   :1.000  
##      STDep           PES         BloodVesNo         thal     
##  Min.   :0.00   Min.   :1.00   Min.   :0.000   Min.   :3.00  
##  1st Qu.:0.00   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:3.00  
##  Median :0.65   Median :2.00   Median :0.000   Median :3.00  
##  Mean   :1.01   Mean   :1.59   Mean   :0.682   Mean   :4.66  
##  3rd Qu.:1.65   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:7.00  
##  Max.   :5.60   Max.   :3.00   Max.   :3.000   Max.   :7.00  
##        HD       
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :1.000  
##  Mean   :0.564  
##  3rd Qu.:1.000  
##  Max.   :1.000

Testing Data


testD <- read.table("C:/Users/Manny/Documents/Manny/Studies/Uni 4th yr - 2nd Tri/CI/CourseWork/NN/cwTEST.txt", 
    header = FALSE)
colnames(testD) <- c("Age", "Sex", "CPType", "RestingBP", "SerumChol", "FBS", 
    "RestingECG", "MaxHR", "EIA", "STDep", "PES", "BloodVesNo", "thal", "HD")
summary(testD)

##       Age            Sex           CPType       RestingBP     SerumChol  
##  Min.   :35.0   Min.   :0.00   Min.   :1.00   Min.   :110   Min.   :164  
##  1st Qu.:48.2   1st Qu.:0.00   1st Qu.:3.00   1st Qu.:120   1st Qu.:214  
##  Median :55.5   Median :1.00   Median :3.00   Median :134   Median :256  
##  Mean   :54.5   Mean   :0.72   Mean   :3.22   Mean   :135   Mean   :254  
##  3rd Qu.:61.0   3rd Qu.:1.00   3rd Qu.:4.00   3rd Qu.:144   3rd Qu.:286  
##  Max.   :71.0   Max.   :1.00   Max.   :4.00   Max.   :180   Max.   :353  
##       FBS         RestingECG     MaxHR          EIA           STDep     
##  Min.   :0.00   Min.   :0    Min.   : 97   Min.   :0.00   Min.   :0.00  
##  1st Qu.:0.00   1st Qu.:0    1st Qu.:138   1st Qu.:0.00   1st Qu.:0.25  
##  Median :0.00   Median :1    Median :153   Median :0.00   Median :1.10  
##  Mean   :0.14   Mean   :1    Mean   :151   Mean   :0.26   Mean   :1.23  
##  3rd Qu.:0.00   3rd Qu.:2    3rd Qu.:162   3rd Qu.:0.75   3rd Qu.:1.60  
##  Max.   :1.00   Max.   :2    Max.   :190   Max.   :1.00   Max.   :6.20  
##       PES         BloodVesNo        thal            HD      
##  Min.   :1.00   Min.   :0.00   Min.   :3.00   Min.   :0.00  
##  1st Qu.:1.00   1st Qu.:0.00   1st Qu.:3.00   1st Qu.:0.00  
##  Median :1.50   Median :0.00   Median :3.00   Median :1.00  
##  Mean   :1.56   Mean   :0.62   Mean   :4.84   Mean   :0.52  
##  3rd Qu.:2.00   3rd Qu.:1.00   3rd Qu.:7.00   3rd Qu.:1.00  
##  Max.   :3.00   Max.   :3.00   Max.   :7.00   Max.   :1.00

Converting to categorical variables

Some values within the data are categorical, we shall label them as such and ensure that they are unordered. This will prevent bias in the model and the tree will more accurately fit the data.

We will then print out the structure of our data frames(objects).

Not that PES is stored as an ordinal variable since it is an ordered categorical variable.


trainD$Sex <- as.factor(trainD$Sex)
trainD$CPType <- as.factor(trainD$CPType)
trainD$FBS <- as.factor(trainD$FBS)
trainD$RestingECG <- as.factor(trainD$RestingECG)
trainD$EIA <- as.factor(trainD$EIA)
trainD$PES <- as.ordered(trainD$PES)
trainD$thal <- as.factor(trainD$thal)
trainD$HD <- as.factor(trainD$HD)

str(trainD)

## 'data.frame':    220 obs. of  14 variables:
##  $ Age       : num  70 67 57 64 74 65 56 59 60 63 ...
##  $ Sex       : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CPType    : Factor w/ 4 levels "1","2","3","4": 4 3 2 4 2 4 3 4 4 4 ...
##  $ RestingBP : num  130 115 124 128 120 120 130 110 140 150 ...
##  $ SerumChol : num  322 564 261 263 269 177 256 239 293 407 ...
##  $ FBS       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ RestingECG: Factor w/ 3 levels "0","1","2": 3 3 1 1 3 1 3 3 3 3 ...
##  $ MaxHR     : num  109 160 141 105 121 140 142 142 170 154 ...
##  $ EIA       : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 2 1 1 ...
##  $ STDep     : num  2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
##  $ PES       : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 2 1 1 2 2 2 2 ...
##  $ BloodVesNo: num  3 0 0 1 1 0 1 1 2 3 ...
##  $ thal      : Factor w/ 3 levels "3","6","7": 1 3 3 3 1 3 2 3 3 3 ...
##  $ HD        : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 1 ...

Lets do the same for the the test data

testD$Sex <- as.factor(testD$Sex)
testD$CPType <- as.factor(testD$CPType)
testD$FBS <- as.factor(testD$FBS)
testD$RestingECG <- as.factor(testD$RestingECG)
testD$EIA <- as.factor(testD$EIA)
testD$PES <- as.ordered(testD$PES)
testD$thal <- as.factor(testD$thal)
testD$HD <- as.factor(testD$HD)

str(testD)

## 'data.frame':    50 obs. of  14 variables:
##  $ Age       : num  54 65 57 63 35 41 62 43 58 52 ...
##  $ Sex       : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 1 2 ...
##  $ CPType    : Factor w/ 4 levels "1","2","3","4": 4 4 3 4 4 2 3 4 1 1 ...
##  $ RestingBP : num  110 135 150 130 138 135 130 132 150 118 ...
##  $ SerumChol : num  239 254 168 330 183 203 263 341 283 186 ...
##  $ FBS       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 2 2 1 ...
##  $ RestingECG: Factor w/ 2 levels "0","2": 1 2 1 2 1 1 1 2 2 2 ...
##  $ MaxHR     : num  126 127 174 132 182 132 97 136 162 190 ...
##  $ EIA       : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
##  $ STDep     : num  2.8 2.8 1.6 1.8 1.4 0 1.2 3 1 0 ...
##  $ PES       : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 1 1 2 2 2 1 2 ...
##  $ BloodVesNo: num  1 1 0 3 0 0 1 0 0 0 ...
##  $ thal      : Factor w/ 3 levels "3","6","7": 3 3 1 3 1 2 3 3 1 2 ...
##  $ HD        : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 2 2 ...

So looks like our data is finally prepared. Lets print out a summary and view our data.

Data Distribution


summary(trainD)

##       Age       Sex     CPType    RestingBP     SerumChol   FBS    
##  Min.   :29.0   0: 73   1: 18   Min.   : 94   Min.   :126   0:187  
##  1st Qu.:47.8   1:147   2: 33   1st Qu.:120   1st Qu.:213   1: 33  
##  Median :55.0           3: 64   Median :130   Median :243          
##  Mean   :54.4           4:105   Mean   :131   Mean   :249          
##  3rd Qu.:61.0                   3rd Qu.:140   3rd Qu.:275          
##  Max.   :77.0                   Max.   :200   Max.   :564          
##  RestingECG     MaxHR     EIA         STDep      PES       BloodVesNo   
##  0:106      Min.   : 71   0:144   Min.   :0.00   1:105   Min.   :0.000  
##  1:  2      1st Qu.:132   1: 76   1st Qu.:0.00   2:100   1st Qu.:0.000  
##  2:112      Median :154           Median :0.65   3: 15   Median :0.000  
##             Mean   :149           Mean   :1.01           Mean   :0.682  
##             3rd Qu.:166           3rd Qu.:1.65           3rd Qu.:1.000  
##             Max.   :202           Max.   :5.60           Max.   :3.000  
##  thal    HD     
##  3:126   0: 96  
##  6: 10   1:124  
##  7: 84          
##                 
##                 
##

View(trainD)

summary(testD)

##       Age       Sex    CPType   RestingBP     SerumChol   FBS   
##  Min.   :35.0   0:14   1: 2   Min.   :110   Min.   :164   0:43  
##  1st Qu.:48.2   1:36   2: 9   1st Qu.:120   1st Qu.:214   1: 7  
##  Median :55.5          3:15   Median :134   Median :256         
##  Mean   :54.5          4:24   Mean   :135   Mean   :254         
##  3rd Qu.:61.0                 3rd Qu.:144   3rd Qu.:286         
##  Max.   :71.0                 Max.   :180   Max.   :353         
##  RestingECG     MaxHR     EIA        STDep      PES      BloodVesNo  
##  0:25       Min.   : 97   0:37   Min.   :0.00   1:25   Min.   :0.00  
##  2:25       1st Qu.:138   1:13   1st Qu.:0.25   2:22   1st Qu.:0.00  
##             Median :153          Median :1.10   3: 3   Median :0.00  
##             Mean   :151          Mean   :1.23          Mean   :0.62  
##             3rd Qu.:162          3rd Qu.:1.60          3rd Qu.:1.00  
##             Max.   :190          Max.   :6.20          Max.   :3.00  
##  thal   HD    
##  3:26   0:24  
##  6: 4   1:26  
##  7:20         
##               
##               
##

View(testD)

It looks like the 'Sex' and 'FBS' training data is heavily biased. Additionally, the Resting ECG '1' category is not represented in the test set.

So lets get started by looking at the distribution of some of our key data points in the training set:

hist(trainD$Age)

plot of chunk unnamed-chunk-6

It looks like the majority of the patients in the test set are between 40 and 70 years old.

So now have a feeling of how our data behaves.

Keeping this in mind, Lets attempt to classify heart disease using some tree-based methods.

Decision Trees

We initially attempt to classify using the simplest tree-based model, the decision tree. So lets attach our dataset to the workspace so that we can easily call its attributes.

You may need to install the tree package #install.packages(tree).

We then fit a tree to this data with the response variable set to the “Heart Disease” category. Finally, we plot the decision tree and view its detailed summary.

set.seed(2)
require(tree)

## Loading required package: tree

attach(trainD)

tree.trainD = tree(HD ~ ., data = trainD)
plot(tree.trainD)
text(tree.trainD, pretty = 0)

plot of chunk unnamed-chunk-7

tree.trainD

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 220 300 1 ( 0.44 0.56 )  
##     2) thal: 6,7 94 100 0 ( 0.73 0.27 )  
##       4) BloodVesNo < 0.5 42  60 0 ( 0.52 0.48 )  
##         8) CPType: 4 25  30 0 ( 0.72 0.28 )  
##          16) SerumChol < 238.5 15  20 0 ( 0.53 0.47 )  
##            32) Age < 56 9  10 0 ( 0.78 0.22 ) *
##            33) Age > 56 6   5 1 ( 0.17 0.83 ) *
##          17) SerumChol > 238.5 10   0 0 ( 1.00 0.00 ) *
##         9) CPType: 1,2,3 17  20 1 ( 0.24 0.76 ) *
##       5) BloodVesNo > 0.5 52  30 0 ( 0.90 0.10 )  
##        10) STDep < 1.1 22  20 0 ( 0.77 0.23 )  
##          20) RestingECG: 0 12  20 0 ( 0.58 0.42 )  
##            40) MaxHR < 155 5   5 1 ( 0.20 0.80 ) *
##            41) MaxHR > 155 7   6 0 ( 0.86 0.14 ) *
##          21) RestingECG: 2 10   0 0 ( 1.00 0.00 ) *
##        11) STDep > 1.1 30   0 0 ( 1.00 0.00 ) *
##     3) thal: 3 126 100 1 ( 0.21 0.79 )  
##       6) CPType: 1,4 54  70 1 ( 0.43 0.57 )  
##        12) BloodVesNo < 0.5 31  30 1 ( 0.23 0.77 )  
##          24) RestingBP < 154 26  20 1 ( 0.12 0.88 )  
##            48) Age < 60.5 17   0 1 ( 0.00 1.00 ) *
##            49) Age > 60.5 9  10 1 ( 0.33 0.67 ) *
##          25) RestingBP > 154 5   5 0 ( 0.80 0.20 ) *
##        13) BloodVesNo > 0.5 23  30 0 ( 0.70 0.30 )  
##          26) MaxHR < 120 6   0 0 ( 1.00 0.00 ) *
##          27) MaxHR > 120 17  20 0 ( 0.59 0.41 )  
##            54) Sex: 0 5   5 1 ( 0.20 0.80 ) *
##            55) Sex: 1 12  10 0 ( 0.75 0.25 )  
##             110) CPType: 1 5   7 1 ( 0.40 0.60 ) *
##             111) CPType: 4 7   0 0 ( 1.00 0.00 ) *
##       7) CPType: 2,3 72  30 1 ( 0.06 0.94 )  
##        14) MaxHR < 152.5 20  20 1 ( 0.20 0.80 )  
##          28) Age < 59 12  20 1 ( 0.33 0.67 ) *
##          29) Age > 59 8   0 1 ( 0.00 1.00 ) *
##        15) MaxHR > 152.5 52   0 1 ( 0.00 1.00 ) *

The length of the legs of a tree branch is proportional to the significance of the splitting variable. As can be seen, thal, CPType and BloodVesNo are the most significant variables.

Lets see how good the tree is at predicting the values it was trained on. We will represent the results in a simple misclassification table and then calculate the classification rate.

tree.pred = predict(tree.trainD, trainD, type = "class")
with(trainD, table(tree.pred, HD))

##          HD
## tree.pred   0   1
##         0  80   4
##         1  16 120

(80 + 120)/220

## [1] 0.9091

We have a base classification rate of 91% on the data that the tree was trained on.

So lets now predict on the test data using this decision tree and see what we get.

We will represent the results in a simple misclassification table and then calculate the classification rate.

tree.pred = predict(tree.trainD, testD[, c(1:13)], type = "class")
with(testD, table(tree.pred, testD$HD))

##          
## tree.pred  0  1
##         0 18  1
##         1  6 25

(18 + 25)/50

## [1] 0.86

As can be seen, our classification rate drops down to 86% when using the basic decision tree against the unseen test data.

The problem is that this tree is probably too complex and is overfitting the data .

We have to remember that the summary and histograms (above) of the data made it very clear that the training data is biased so we need to simplify the model in order to provide for better classification on unseen data sets.

Simpler models are always better.

Cross-Validation for tree pruning

We have one tuning parameter with the decision tree which is the depth of the tree.

To simplify the model, we are going to 'prune' the branches of the tree. We will use cross-validation to decide the depth at which the branches will be pruned. So basically, cross validation will be used to determine the size and complexity of the decision tree.

The method cv.tree, does 10-fold cross-validation by default. We will then prune the tree according to the misclassification rate.

Cross Validation should always be used to choose values for tuning parameters

cv.trainD = cv.tree(tree.trainD, FUN = prune.misclass)
cv.trainD

## $size
## [1] 18 15 11  9  7  6  2  1
## 
## $dev
## [1] 53 54 54 55 56 58 61 96
## 
## $k
## [1] -Inf  0.0  1.0  1.5  2.0  3.0  4.5 44.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.trainD)

plot of chunk unnamed-chunk-10

From the plot of the misclassification rate against the size of the tree (the number of nodes/leafes at the bottom of the tree). We see that we can prune the tree down significantly and reduce its complexity by a factor of two (from 18 leaf nodes to 9) without performing significantly worse on the training data.

General rule of thumb in statistics is to always choose a simpler model, as long as it is within one standard deviation of the best model. This reduces the model bias.

So lets test the new tree on our training data.

prune.trainD = prune.misclass(tree.trainD, best = 9)
tree.pred = predict(prune.trainD, trainD, type = "class")
with(trainD, table(tree.pred, HD))

##          HD
## tree.pred   0   1
##         0  84  15
##         1  12 109

(84 + 109)/220

## [1] 0.8773

Our classification rate against the training data used to build the tree has dropped from 91% to 88%.

Lets see how that compares against the unseen test data, using the new classification tree.

tree.pred = predict(prune.trainD, testD[, c(1:13)], type = "class")
with(testD, table(tree.pred, testD$HD))

##          
## tree.pred  0  1
##         0 19  1
##         1  5 25

(19 + 25)/50

## [1] 0.88

The simpler tree performs better than the original tree on the test data. Our classification rate has increased from 86% to 88%.

Let's plot out new tree

plot(prune.trainD)
text(prune.trainD, pretty = 0)

plot of chunk unnamed-chunk-13

As can be seen, this tree is alot simpler. It only uses 6 of the 13 variables and performs better than the tree that used all the variables. So pruning with cross-validation gave us a simpler yet more accurate tree.

Random Forests

Let us now fit a basic random forest to our data.

You may need to install the tree package #install.packages(randomForest).

set.seed(2)
require(randomForest)

## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.


rf.trainD = randomForest(HD ~ ., data = trainD)

Lets use this forest to predict on the training data.

rf.pred = predict(rf.trainD, trainD)
with(trainD, table(rf.pred, trainD$HD))

##        
## rf.pred   0   1
##       0  96   0
##       1   0 124

(96 + 124)/220

## [1] 1

As can be seen, we obtain 100% classification on the data used to train the random forest. This behaviour is expected of random forests applied to their training data. One of the advantages of random forests is that they don't over fit the data.

Lets predict using our random forest on the unseen test set.

Note that we are adding the missing category from restingECG in the first line below. This is needed for the randomForest package to work. It cannot predict on data with differrent levels to the variables.

levels(testD$RestingECG)[3] <- "1"
rf.pred = predict(rf.trainD, testD[, c(1:13)])
with(testD, table(rf.pred, testD$HD))

##        
## rf.pred  0  1
##       0 20  1
##       1  4 25

(20 + 25)/50

## [1] 0.9

We can see that the basic random forest performs better than our best decision tree on the test data set. It has a classification rate of 90% on the data.

The problem is that the random forest is probably way to complex for this type of data set. Lets first reduce the number of trees being grown in this forest.

plot(rf.trainD, log = "y")
legend("top", colnames(rf.trainD$err.rate), col = 1:4, cex = 0.8, fill = 1:4)

plot of chunk unnamed-chunk-17

As can be seen, the average Out-Of-Bag (OOB) error doesn't seem to improve much after we the tree count goes above 120.

So we shall limit the tree count to 110.

rf.trainD = randomForest(HD ~ ., data = trainD, importance = TRUE, ntree = 110)
rf.trainD = randomForest(HD ~ ., data = trainD, ntree = 110, do.trace = 10)

## ntree      OOB      1      2
##    10:  21.56% 28.72% 16.13%
##    20:  21.36% 27.08% 16.94%
##    30:  21.82% 23.96% 20.16%
##    40:  20.45% 22.92% 18.55%
##    50:  18.64% 21.88% 16.13%
##    60:  20.00% 23.96% 16.94%
##    70:  20.00% 25.00% 16.13%
##    80:  20.00% 25.00% 16.13%
##    90:  19.55% 26.04% 14.52%
##   100:  19.09% 25.00% 14.52%
##   110:  17.73% 25.00% 12.10%

rf.pred = predict(rf.trainD, testD[, c(1:13)])
with(testD, table(rf.pred, testD$HD))

##        
## rf.pred  0  1
##       0 20  1
##       1  4 25

(20 + 25)/50

## [1] 0.9

With our random forest, we can now get back up to 90% classification rate using a simpler tree.

The original random forest was more complex, it used 500 trees.

The final random forest uses only 110 trees but it performs as well as the 500-tree forest and classifies 90% of unseen test data.

We gained an improvement of 2% using a random forests over decision trees.

Random Forests are a black box technique and there is no way of predicting how they will behave.

So, lets see what made this forest good at classifying the data.

plot(rf.trainD, log = "y")
legend("top", colnames(rf.trainD$err.rate), col = 1:4, cex = 0.8, fill = 1:4)

plot of chunk unnamed-chunk-19

rf.trainD$importance

##            MeanDecreaseGini
## Age                  8.8435
## Sex                  3.0150
## CPType              13.3204
## RestingBP            9.6944
## SerumChol            8.8971
## FBS                  0.8224
## RestingECG           2.0544
## MaxHR               13.6824
## EIA                  4.8880
## STDep                9.8055
## PES                  4.9971
## BloodVesNo          13.7218
## thal                12.9828

As can be seen, the importance of any variable to a particular sex doesnt seem to vary much when compared to the other sex.

When we look at the 'MeanDecreaseAccuracy', which measures how important a variable is at reducing the classification error rate, we see that the 'thal', 'BloodVesNo' and 'CPType' are the most important variables. This is similar to what we observed in the decision tree.

Additionally, if we look at the 'MeanDecreaseGini', which measure how important each variable is at separating the data, we see again that 'thal', 'BloodVesNo' and 'CPType' are very important. But so are 'STDep' and 'MaxHR'.

Cross Validation for Random Forest

Random forests also have one last parameter that we can tune.

It is called “mtry”, it is basically the number of predictors sampled by each splitting node. It's default value is 3.

We will use cross-validation to pick this parameter.

You may need to install the caret package #install.packages(“caret”). If caret doesn't behave as expected, you may be missing one of its 21 dependacies.

You can install them through #install.packages(“caret”, dependencies = c(“Depends”, “Suggests”))

THIS IS A LONG INSTALL ~15 minutes

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(20)

controller = trainControl(method = "cv", number = 10, repeats = 1, returnResamp = "final")

cv.rf.trainD = train(HD ~ ., data = trainD, method = "rf", trControl = controller, 
    ntree = 110, importance = TRUE, tuneLength = 17)

cv.rf.pred = predict(cv.rf.trainD, testD[, c(1:13)])

with(testD, table(cv.rf.pred, testD$HD))

##           
## cv.rf.pred  0  1
##          0 20  1
##          1  4 25

(20 + 25)/50

## [1] 0.9

10-fold Cross-validation on the mtry parameter doesn't seem to have improved the classification rate.

That means that the default value was a good value to use.

Lets see details for this model

cv.rf.trainD$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 110, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 110
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20%
## Confusion matrix:
##    0   1 class.error
## 0 71  25      0.2604
## 1 19 105      0.1532

This basically re-iterates the default values the 'mtry' parameter. We have the same OOB and the same mtry. The only difference is that this was achieved via cross-validation.

Lets look in detail at how the values of mtry varied and what significance they had.

cv.rf.trainD

## Random Forest 
## 
## 220 samples
##  13 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 199, 199, 198, 198, 197, 198, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     0.8       0.6    0.1          0.3     
##   3     0.8       0.6    0.1          0.3     
##   4     0.8       0.6    0.1          0.2     
##   5     0.8       0.6    0.1          0.3     
##   6     0.8       0.6    0.1          0.3     
##   7     0.8       0.6    0.1          0.3     
##   8     0.8       0.6    0.1          0.3     
##   9     0.8       0.6    0.2          0.3     
##   10    0.8       0.6    0.1          0.3     
##   10    0.8       0.6    0.1          0.3     
##   10    0.8       0.6    0.1          0.2     
##   10    0.8       0.6    0.1          0.3     
##   10    0.8       0.6    0.1          0.3     
##   20    0.8       0.6    0.1          0.3     
##   20    0.8       0.6    0.1          0.3     
##   20    0.8       0.6    0.1          0.2     
##   20    0.8       0.6    0.1          0.3     
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

All values of mtry have the same accuracy and kappa. This means that the 'mtry' tuning parameter is insignificant for this forest.

Now lets see what variables it thought were significant.

varImp(cv.rf.trainD)

## rf variable importance
## 
##             Importance
## BloodVesNo       100.0
## thal7             94.4
## CPType4           87.0
## STDep             52.2
## MaxHR             51.4
## PES.Q             47.8
## Sex1              44.0
## PES.L             41.4
## EIA1              35.7
## CPType3           34.6
## thal6             33.5
## RestingECG2       17.2
## CPType2           16.9
## SerumChol         15.1
## RestingECG1       13.5
## Age               12.4
## RestingBP         12.2
## FBS1               0.0

We see some of the variables that are most significant at separating the data, but in more detail than before.

Our data set and variable count is too small to warrant investigation of other classification techiniques such as Boosting

Cross-Validated - Neural Networks

Neural networks are very differrent from the tree-based approaches so far.

This neural network has one hidden layer.

It has only two tuning parameters (neurons in hidden layer and weight decay).

Lets use 10-fold cross validation to select the number of neurons in the hidden layer and weight decay.

You may need to install the neural network package nnet. #install.packages(“nnet”)

require("nnet")

## Loading required package: nnet



controller = trainControl(method = "cv", number = 10, repeats = 1, returnResamp = "final")
nn.net = train(trainD$HD ~ ., method = "nnet", data = trainD, trControl = controller, 
    finalModel = T, trace = FALSE)
nn.pred = predict(nn.net, testD[, c(1:13)])
with(testD, table(nn.pred, testD$HD))

##        
## nn.pred  0  1
##       0 20  3
##       1  4 23

(20 + 23)/50

## [1] 0.86

So our best performance on the neural network is 86% percent using a neural network.

Lets look at this model in more detail

nn.net$bestTune

##   size decay
## 6    3   0.1

nn.net$finalModel

## a 18-3-1 network with 61 weights
## inputs: Age Sex1 CPType2 CPType3 CPType4 RestingBP SerumChol FBS1 RestingECG1 RestingECG2 MaxHR EIA1 STDep PES.L PES.Q BloodVesNo thal6 thal7 
## output(s): .outcome 
## options were - entropy fitting  decay=0.1

Our best settings are 3 hidden units and a 0.1 weight decay.

Again, if you really need to know the details on how this choice is made then you can look at:

plot(nn.net)

plot of chunk unnamed-chunk-27

nn.net$results

##   size decay Accuracy   Kappa AccuracySD KappaSD
## 1    1 0e+00   0.5876 0.07154    0.07859  0.1892
## 2    1 1e-04   0.6477 0.19979    0.13233  0.3225
## 3    1 1e-01   0.7970 0.57519    0.12207  0.2725
## 4    3 0e+00   0.6405 0.21512    0.11149  0.2760
## 5    3 1e-04   0.5855 0.05418    0.06999  0.1713
## 6    3 1e-01   0.8363 0.66245    0.09912  0.2073
## 7    5 0e+00   0.6561 0.25220    0.13016  0.3112
## 8    5 1e-04   0.6387 0.20945    0.11054  0.2818
## 9    5 1e-01   0.8319 0.65634    0.05326  0.1056

Let us see what our final neural network looks like:

You will need the nnet.plot function : #https://gist.github.com/fawda123/7471137

plot(nn.net$finalModel)

## Loading required package: scales
## Loading required package: reshape
## Loading required package: plyr
## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any

plot of chunk unnamed-chunk-29

There is something peculiar about our neural network. Sex has suddenly become a very important predictor (highlighted by a dark thick line).

This is inconsistent with our previous models.

Normalising the data

We need to scale the data, this was not an issue when dealing with decision trees and random forests. But it becomes an issue when dealing with neural networks

http://stackoverflow.com/questions/8961586/do-i-need-to-normalize-or-scale-data-for-randomforest-r-package

We will basically normalise all real values and retrain the network.

trainD$Age <- scale(trainD$Age)
trainD$RestingBP <- scale(trainD$RestingBP)
trainD$SerumChol <- scale(trainD$SerumChol)
trainD$MaxHR <- scale(trainD$MaxHR)
trainD$STDep <- scale(trainD$STDep)
trainD$BloodVesNo <- scale(trainD$BloodVesNo)


testD$Age <- scale(testD$Age)
testD$RestingBP <- scale(testD$RestingBP)
testD$SerumChol <- scale(testD$SerumChol)
testD$MaxHR <- scale(testD$MaxHR)
testD$STDep <- scale(testD$STDep)
testD$BloodVesNo <- scale(testD$BloodVesNo)


controller = trainControl(method = "cv", number = 10, repeats = 1, returnResamp = "final")
nn.net = train(trainD$HD ~ ., method = "nnet", data = trainD, trControl = controller, 
    finalModel = T, trace = FALSE)
nn.pred = predict(nn.net, testD[, c(1:13)])
with(testD, table(nn.pred, testD$HD))

##        
## nn.pred  0  1
##       0 19  1
##       1  5 25

(19 + 25)/50

## [1] 0.88

After rescaling, our classification rate is now up to 88%.

nn.net$bestTune

##   size decay
## 3    1   0.1

nn.net$finalModel

## a 18-1-1 network with 21 weights
## inputs: Age Sex1 CPType2 CPType3 CPType4 RestingBP SerumChol FBS1 RestingECG1 RestingECG2 MaxHR EIA1 STDep PES.L PES.Q BloodVesNo thal6 thal7 
## output(s): .outcome 
## options were - entropy fitting  decay=0.1

Our model is also much simpler, with only one neuron in the hidden layer and 21 weights.

Lets also plot how the differrent values for weight decay affect the accuracy for differrent values of the hidden weights.

plot(nn.net)

plot of chunk unnamed-chunk-32

nn.net$results

##   size decay Accuracy  Kappa AccuracySD KappaSD
## 1    1 0e+00   0.8129 0.6249    0.08079  0.1577
## 2    1 1e-04   0.8127 0.6281    0.07734  0.1497
## 3    1 1e-01   0.8356 0.6653    0.06069  0.1266
## 4    3 0e+00   0.7581 0.5098    0.07456  0.1490
## 5    3 1e-04   0.8224 0.6398    0.05565  0.1153
## 6    3 1e-01   0.8131 0.6192    0.08597  0.1793
## 7    5 0e+00   0.7676 0.5274    0.06438  0.1348
## 8    5 1e-04   0.7546 0.5028    0.07113  0.1449
## 9    5 1e-01   0.8124 0.6178    0.07658  0.1624

Let us also plot our final neural network structure.

plot(nn.net$finalModel)

plot of chunk unnamed-chunk-33

Note that the black lines are positive weights while the gray ones are negative. Additionally, there thickness represents there magnitude.

We can see that Sex is no longer an overly emphasised predictor.

We can also view our weights.

plot(nn.net$finalModel, wts.only = T)

## $`hidden 1 1`
##  [1]  2.82994  0.39039 -1.13632 -1.21786 -0.04779 -2.01690 -0.55280
##  [8] -0.23061 -0.08214 -0.01823  0.04503  0.72545 -0.44754 -0.33525
## [15]  0.36162  1.19904 -1.51615  0.03882 -1.71760
## 
## $`out 1`
## [1] -2.669  6.000

Note that the first (top) value is always the bias weight. And we just read the weights going into each layer from top to bottom.

With this, we complete our investigation and recommend all of these models for the classifcation of heart disease problem. But we ramain biased towards the tree based methods as they will scale alot better.

End of Investigation.

Copyright © 2014 Emmanuel Kahembwe

CI - Problem 3: Heart Disease Classification