CI - Problem 3: Heart Disease Classification
Problem 3: Heart Disease Classification
This is the accompanying document to the heart disease classification report for the computational intelligence coursework by student 10011412. It details the process taken in modelling and gaining insights into the heart disease data.
Formatting Data
Initially, we will first import the data into R and label it
Training Data
trainD <- read.table("C:/Users/Manny/Documents/Manny/Studies/Uni 4th yr - 2nd Tri/CI/CourseWork/NN/cwTRAIN.txt",
header = FALSE)
colnames(trainD) <- c("Age", "Sex", "CPType", "RestingBP", "SerumChol", "FBS",
"RestingECG", "MaxHR", "EIA", "STDep", "PES", "BloodVesNo", "thal", "HD")
summary(trainD)
## Age Sex CPType RestingBP SerumChol
## Min. :29.0 Min. :0.000 Min. :1.00 Min. : 94 Min. :126
## 1st Qu.:47.8 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:120 1st Qu.:213
## Median :55.0 Median :1.000 Median :3.00 Median :130 Median :243
## Mean :54.4 Mean :0.668 Mean :3.16 Mean :131 Mean :249
## 3rd Qu.:61.0 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:140 3rd Qu.:275
## Max. :77.0 Max. :1.000 Max. :4.00 Max. :200 Max. :564
## FBS RestingECG MaxHR EIA
## Min. :0.00 Min. :0.00 Min. : 71 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:132 1st Qu.:0.000
## Median :0.00 Median :2.00 Median :154 Median :0.000
## Mean :0.15 Mean :1.03 Mean :149 Mean :0.345
## 3rd Qu.:0.00 3rd Qu.:2.00 3rd Qu.:166 3rd Qu.:1.000
## Max. :1.00 Max. :2.00 Max. :202 Max. :1.000
## STDep PES BloodVesNo thal
## Min. :0.00 Min. :1.00 Min. :0.000 Min. :3.00
## 1st Qu.:0.00 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:3.00
## Median :0.65 Median :2.00 Median :0.000 Median :3.00
## Mean :1.01 Mean :1.59 Mean :0.682 Mean :4.66
## 3rd Qu.:1.65 3rd Qu.:2.00 3rd Qu.:1.000 3rd Qu.:7.00
## Max. :5.60 Max. :3.00 Max. :3.000 Max. :7.00
## HD
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.564
## 3rd Qu.:1.000
## Max. :1.000
Testing Data
testD <- read.table("C:/Users/Manny/Documents/Manny/Studies/Uni 4th yr - 2nd Tri/CI/CourseWork/NN/cwTEST.txt",
header = FALSE)
colnames(testD) <- c("Age", "Sex", "CPType", "RestingBP", "SerumChol", "FBS",
"RestingECG", "MaxHR", "EIA", "STDep", "PES", "BloodVesNo", "thal", "HD")
summary(testD)
## Age Sex CPType RestingBP SerumChol
## Min. :35.0 Min. :0.00 Min. :1.00 Min. :110 Min. :164
## 1st Qu.:48.2 1st Qu.:0.00 1st Qu.:3.00 1st Qu.:120 1st Qu.:214
## Median :55.5 Median :1.00 Median :3.00 Median :134 Median :256
## Mean :54.5 Mean :0.72 Mean :3.22 Mean :135 Mean :254
## 3rd Qu.:61.0 3rd Qu.:1.00 3rd Qu.:4.00 3rd Qu.:144 3rd Qu.:286
## Max. :71.0 Max. :1.00 Max. :4.00 Max. :180 Max. :353
## FBS RestingECG MaxHR EIA STDep
## Min. :0.00 Min. :0 Min. : 97 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0 1st Qu.:138 1st Qu.:0.00 1st Qu.:0.25
## Median :0.00 Median :1 Median :153 Median :0.00 Median :1.10
## Mean :0.14 Mean :1 Mean :151 Mean :0.26 Mean :1.23
## 3rd Qu.:0.00 3rd Qu.:2 3rd Qu.:162 3rd Qu.:0.75 3rd Qu.:1.60
## Max. :1.00 Max. :2 Max. :190 Max. :1.00 Max. :6.20
## PES BloodVesNo thal HD
## Min. :1.00 Min. :0.00 Min. :3.00 Min. :0.00
## 1st Qu.:1.00 1st Qu.:0.00 1st Qu.:3.00 1st Qu.:0.00
## Median :1.50 Median :0.00 Median :3.00 Median :1.00
## Mean :1.56 Mean :0.62 Mean :4.84 Mean :0.52
## 3rd Qu.:2.00 3rd Qu.:1.00 3rd Qu.:7.00 3rd Qu.:1.00
## Max. :3.00 Max. :3.00 Max. :7.00 Max. :1.00
Converting to categorical variables
Some values within the data are categorical, we shall label them as such and ensure that they are unordered. This will prevent bias in the model and the tree will more accurately fit the data.
We will then print out the structure of our data frames(objects).
Not that PES is stored as an ordinal variable since it is an ordered categorical variable.
trainD$Sex <- as.factor(trainD$Sex)
trainD$CPType <- as.factor(trainD$CPType)
trainD$FBS <- as.factor(trainD$FBS)
trainD$RestingECG <- as.factor(trainD$RestingECG)
trainD$EIA <- as.factor(trainD$EIA)
trainD$PES <- as.ordered(trainD$PES)
trainD$thal <- as.factor(trainD$thal)
trainD$HD <- as.factor(trainD$HD)
str(trainD)
## 'data.frame': 220 obs. of 14 variables:
## $ Age : num 70 67 57 64 74 65 56 59 60 63 ...
## $ Sex : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CPType : Factor w/ 4 levels "1","2","3","4": 4 3 2 4 2 4 3 4 4 4 ...
## $ RestingBP : num 130 115 124 128 120 120 130 110 140 150 ...
## $ SerumChol : num 322 564 261 263 269 177 256 239 293 407 ...
## $ FBS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ RestingECG: Factor w/ 3 levels "0","1","2": 3 3 1 1 3 1 3 3 3 3 ...
## $ MaxHR : num 109 160 141 105 121 140 142 142 170 154 ...
## $ EIA : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 2 1 1 ...
## $ STDep : num 2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
## $ PES : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 2 1 1 2 2 2 2 ...
## $ BloodVesNo: num 3 0 0 1 1 0 1 1 2 3 ...
## $ thal : Factor w/ 3 levels "3","6","7": 1 3 3 3 1 3 2 3 3 3 ...
## $ HD : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 1 ...
Lets do the same for the the test data
testD$Sex <- as.factor(testD$Sex)
testD$CPType <- as.factor(testD$CPType)
testD$FBS <- as.factor(testD$FBS)
testD$RestingECG <- as.factor(testD$RestingECG)
testD$EIA <- as.factor(testD$EIA)
testD$PES <- as.ordered(testD$PES)
testD$thal <- as.factor(testD$thal)
testD$HD <- as.factor(testD$HD)
str(testD)
## 'data.frame': 50 obs. of 14 variables:
## $ Age : num 54 65 57 63 35 41 62 43 58 52 ...
## $ Sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 1 2 ...
## $ CPType : Factor w/ 4 levels "1","2","3","4": 4 4 3 4 4 2 3 4 1 1 ...
## $ RestingBP : num 110 135 150 130 138 135 130 132 150 118 ...
## $ SerumChol : num 239 254 168 330 183 203 263 341 283 186 ...
## $ FBS : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 2 2 1 ...
## $ RestingECG: Factor w/ 2 levels "0","2": 1 2 1 2 1 1 1 2 2 2 ...
## $ MaxHR : num 126 127 174 132 182 132 97 136 162 190 ...
## $ EIA : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
## $ STDep : num 2.8 2.8 1.6 1.8 1.4 0 1.2 3 1 0 ...
## $ PES : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 1 1 2 2 2 1 2 ...
## $ BloodVesNo: num 1 1 0 3 0 0 1 0 0 0 ...
## $ thal : Factor w/ 3 levels "3","6","7": 3 3 1 3 1 2 3 3 1 2 ...
## $ HD : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 2 2 ...
So looks like our data is finally prepared. Lets print out a summary and view our data.
Data Distribution
summary(trainD)
## Age Sex CPType RestingBP SerumChol FBS
## Min. :29.0 0: 73 1: 18 Min. : 94 Min. :126 0:187
## 1st Qu.:47.8 1:147 2: 33 1st Qu.:120 1st Qu.:213 1: 33
## Median :55.0 3: 64 Median :130 Median :243
## Mean :54.4 4:105 Mean :131 Mean :249
## 3rd Qu.:61.0 3rd Qu.:140 3rd Qu.:275
## Max. :77.0 Max. :200 Max. :564
## RestingECG MaxHR EIA STDep PES BloodVesNo
## 0:106 Min. : 71 0:144 Min. :0.00 1:105 Min. :0.000
## 1: 2 1st Qu.:132 1: 76 1st Qu.:0.00 2:100 1st Qu.:0.000
## 2:112 Median :154 Median :0.65 3: 15 Median :0.000
## Mean :149 Mean :1.01 Mean :0.682
## 3rd Qu.:166 3rd Qu.:1.65 3rd Qu.:1.000
## Max. :202 Max. :5.60 Max. :3.000
## thal HD
## 3:126 0: 96
## 6: 10 1:124
## 7: 84
##
##
##
View(trainD)
summary(testD)
## Age Sex CPType RestingBP SerumChol FBS
## Min. :35.0 0:14 1: 2 Min. :110 Min. :164 0:43
## 1st Qu.:48.2 1:36 2: 9 1st Qu.:120 1st Qu.:214 1: 7
## Median :55.5 3:15 Median :134 Median :256
## Mean :54.5 4:24 Mean :135 Mean :254
## 3rd Qu.:61.0 3rd Qu.:144 3rd Qu.:286
## Max. :71.0 Max. :180 Max. :353
## RestingECG MaxHR EIA STDep PES BloodVesNo
## 0:25 Min. : 97 0:37 Min. :0.00 1:25 Min. :0.00
## 2:25 1st Qu.:138 1:13 1st Qu.:0.25 2:22 1st Qu.:0.00
## Median :153 Median :1.10 3: 3 Median :0.00
## Mean :151 Mean :1.23 Mean :0.62
## 3rd Qu.:162 3rd Qu.:1.60 3rd Qu.:1.00
## Max. :190 Max. :6.20 Max. :3.00
## thal HD
## 3:26 0:24
## 6: 4 1:26
## 7:20
##
##
##
View(testD)
It looks like the 'Sex' and 'FBS' training data is heavily biased. Additionally, the Resting ECG '1' category is not represented in the test set.
So lets get started by looking at the distribution of some of our key data points in the training set:
hist(trainD$Age)
It looks like the majority of the patients in the test set are between 40 and 70 years old.
So now have a feeling of how our data behaves.
Keeping this in mind, Lets attempt to classify heart disease using some tree-based methods.
Decision Trees
We initially attempt to classify using the simplest tree-based model, the decision tree. So lets attach our dataset to the workspace so that we can easily call its attributes.
You may need to install the tree package #install.packages(tree).
We then fit a tree to this data with the response variable set to the “Heart Disease” category. Finally, we plot the decision tree and view its detailed summary.
set.seed(2)
require(tree)
## Loading required package: tree
attach(trainD)
tree.trainD = tree(HD ~ ., data = trainD)
plot(tree.trainD)
text(tree.trainD, pretty = 0)
tree.trainD
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 220 300 1 ( 0.44 0.56 )
## 2) thal: 6,7 94 100 0 ( 0.73 0.27 )
## 4) BloodVesNo < 0.5 42 60 0 ( 0.52 0.48 )
## 8) CPType: 4 25 30 0 ( 0.72 0.28 )
## 16) SerumChol < 238.5 15 20 0 ( 0.53 0.47 )
## 32) Age < 56 9 10 0 ( 0.78 0.22 ) *
## 33) Age > 56 6 5 1 ( 0.17 0.83 ) *
## 17) SerumChol > 238.5 10 0 0 ( 1.00 0.00 ) *
## 9) CPType: 1,2,3 17 20 1 ( 0.24 0.76 ) *
## 5) BloodVesNo > 0.5 52 30 0 ( 0.90 0.10 )
## 10) STDep < 1.1 22 20 0 ( 0.77 0.23 )
## 20) RestingECG: 0 12 20 0 ( 0.58 0.42 )
## 40) MaxHR < 155 5 5 1 ( 0.20 0.80 ) *
## 41) MaxHR > 155 7 6 0 ( 0.86 0.14 ) *
## 21) RestingECG: 2 10 0 0 ( 1.00 0.00 ) *
## 11) STDep > 1.1 30 0 0 ( 1.00 0.00 ) *
## 3) thal: 3 126 100 1 ( 0.21 0.79 )
## 6) CPType: 1,4 54 70 1 ( 0.43 0.57 )
## 12) BloodVesNo < 0.5 31 30 1 ( 0.23 0.77 )
## 24) RestingBP < 154 26 20 1 ( 0.12 0.88 )
## 48) Age < 60.5 17 0 1 ( 0.00 1.00 ) *
## 49) Age > 60.5 9 10 1 ( 0.33 0.67 ) *
## 25) RestingBP > 154 5 5 0 ( 0.80 0.20 ) *
## 13) BloodVesNo > 0.5 23 30 0 ( 0.70 0.30 )
## 26) MaxHR < 120 6 0 0 ( 1.00 0.00 ) *
## 27) MaxHR > 120 17 20 0 ( 0.59 0.41 )
## 54) Sex: 0 5 5 1 ( 0.20 0.80 ) *
## 55) Sex: 1 12 10 0 ( 0.75 0.25 )
## 110) CPType: 1 5 7 1 ( 0.40 0.60 ) *
## 111) CPType: 4 7 0 0 ( 1.00 0.00 ) *
## 7) CPType: 2,3 72 30 1 ( 0.06 0.94 )
## 14) MaxHR < 152.5 20 20 1 ( 0.20 0.80 )
## 28) Age < 59 12 20 1 ( 0.33 0.67 ) *
## 29) Age > 59 8 0 1 ( 0.00 1.00 ) *
## 15) MaxHR > 152.5 52 0 1 ( 0.00 1.00 ) *
The length of the legs of a tree branch is proportional to the significance of the splitting variable. As can be seen, thal, CPType and BloodVesNo are the most significant variables.
Lets see how good the tree is at predicting the values it was trained on. We will represent the results in a simple misclassification table and then calculate the classification rate.
tree.pred = predict(tree.trainD, trainD, type = "class")
with(trainD, table(tree.pred, HD))
## HD
## tree.pred 0 1
## 0 80 4
## 1 16 120
(80 + 120)/220
## [1] 0.9091
We have a base classification rate of 91% on the data that the tree was trained on.
So lets now predict on the test data using this decision tree and see what we get.
We will represent the results in a simple misclassification table and then calculate the classification rate.
tree.pred = predict(tree.trainD, testD[, c(1:13)], type = "class")
with(testD, table(tree.pred, testD$HD))
##
## tree.pred 0 1
## 0 18 1
## 1 6 25
(18 + 25)/50
## [1] 0.86
As can be seen, our classification rate drops down to 86% when using the basic decision tree against the unseen test data.
The problem is that this tree is probably too complex and is overfitting the data .
We have to remember that the summary and histograms (above) of the data made it very clear that the training data is biased so we need to simplify the model in order to provide for better classification on unseen data sets.
Simpler models are always better.
Cross-Validation for tree pruning
We have one tuning parameter with the decision tree which is the depth of the tree.
To simplify the model, we are going to 'prune' the branches of the tree. We will use cross-validation to decide the depth at which the branches will be pruned. So basically, cross validation will be used to determine the size and complexity of the decision tree.
The method cv.tree, does 10-fold cross-validation by default. We will then prune the tree according to the misclassification rate.
Cross Validation should always be used to choose values for tuning parameters
cv.trainD = cv.tree(tree.trainD, FUN = prune.misclass)
cv.trainD
## $size
## [1] 18 15 11 9 7 6 2 1
##
## $dev
## [1] 53 54 54 55 56 58 61 96
##
## $k
## [1] -Inf 0.0 1.0 1.5 2.0 3.0 4.5 44.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.trainD)
From the plot of the misclassification rate against the size of the tree (the number of nodes/leafes at the bottom of the tree). We see that we can prune the tree down significantly and reduce its complexity by a factor of two (from 18 leaf nodes to 9) without performing significantly worse on the training data.
General rule of thumb in statistics is to always choose a simpler model, as long as it is within one standard deviation of the best model. This reduces the model bias.
So lets test the new tree on our training data.
prune.trainD = prune.misclass(tree.trainD, best = 9)
tree.pred = predict(prune.trainD, trainD, type = "class")
with(trainD, table(tree.pred, HD))
## HD
## tree.pred 0 1
## 0 84 15
## 1 12 109
(84 + 109)/220
## [1] 0.8773
Our classification rate against the training data used to build the tree has dropped from 91% to 88%.
Lets see how that compares against the unseen test data, using the new classification tree.
tree.pred = predict(prune.trainD, testD[, c(1:13)], type = "class")
with(testD, table(tree.pred, testD$HD))
##
## tree.pred 0 1
## 0 19 1
## 1 5 25
(19 + 25)/50
## [1] 0.88
The simpler tree performs better than the original tree on the test data. Our classification rate has increased from 86% to 88%.
Let's plot out new tree
plot(prune.trainD)
text(prune.trainD, pretty = 0)
As can be seen, this tree is alot simpler. It only uses 6 of the 13 variables and performs better than the tree that used all the variables. So pruning with cross-validation gave us a simpler yet more accurate tree.
Random Forests
Let us now fit a basic random forest to our data.
You may need to install the tree package #install.packages(randomForest).
set.seed(2)
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
rf.trainD = randomForest(HD ~ ., data = trainD)
Lets use this forest to predict on the training data.
rf.pred = predict(rf.trainD, trainD)
with(trainD, table(rf.pred, trainD$HD))
##
## rf.pred 0 1
## 0 96 0
## 1 0 124
(96 + 124)/220
## [1] 1
As can be seen, we obtain 100% classification on the data used to train the random forest. This behaviour is expected of random forests applied to their training data. One of the advantages of random forests is that they don't over fit the data.
Lets predict using our random forest on the unseen test set.
Note that we are adding the missing category from restingECG in the first line below. This is needed for the randomForest package to work. It cannot predict on data with differrent levels to the variables.
levels(testD$RestingECG)[3] <- "1"
rf.pred = predict(rf.trainD, testD[, c(1:13)])
with(testD, table(rf.pred, testD$HD))
##
## rf.pred 0 1
## 0 20 1
## 1 4 25
(20 + 25)/50
## [1] 0.9
We can see that the basic random forest performs better than our best decision tree on the test data set. It has a classification rate of 90% on the data.
The problem is that the random forest is probably way to complex for this type of data set. Lets first reduce the number of trees being grown in this forest.
plot(rf.trainD, log = "y")
legend("top", colnames(rf.trainD$err.rate), col = 1:4, cex = 0.8, fill = 1:4)
As can be seen, the average Out-Of-Bag (OOB) error doesn't seem to improve much after we the tree count goes above 120.
So we shall limit the tree count to 110.
rf.trainD = randomForest(HD ~ ., data = trainD, importance = TRUE, ntree = 110)
rf.trainD = randomForest(HD ~ ., data = trainD, ntree = 110, do.trace = 10)
## ntree OOB 1 2
## 10: 21.56% 28.72% 16.13%
## 20: 21.36% 27.08% 16.94%
## 30: 21.82% 23.96% 20.16%
## 40: 20.45% 22.92% 18.55%
## 50: 18.64% 21.88% 16.13%
## 60: 20.00% 23.96% 16.94%
## 70: 20.00% 25.00% 16.13%
## 80: 20.00% 25.00% 16.13%
## 90: 19.55% 26.04% 14.52%
## 100: 19.09% 25.00% 14.52%
## 110: 17.73% 25.00% 12.10%
rf.pred = predict(rf.trainD, testD[, c(1:13)])
with(testD, table(rf.pred, testD$HD))
##
## rf.pred 0 1
## 0 20 1
## 1 4 25
(20 + 25)/50
## [1] 0.9
With our random forest, we can now get back up to 90% classification rate using a simpler tree.
The original random forest was more complex, it used 500 trees.
The final random forest uses only 110 trees but it performs as well as the 500-tree forest and classifies 90% of unseen test data.
We gained an improvement of 2% using a random forests over decision trees.
Random Forests are a black box technique and there is no way of predicting how they will behave.
So, lets see what made this forest good at classifying the data.
plot(rf.trainD, log = "y")
legend("top", colnames(rf.trainD$err.rate), col = 1:4, cex = 0.8, fill = 1:4)
rf.trainD$importance
## MeanDecreaseGini
## Age 8.8435
## Sex 3.0150
## CPType 13.3204
## RestingBP 9.6944
## SerumChol 8.8971
## FBS 0.8224
## RestingECG 2.0544
## MaxHR 13.6824
## EIA 4.8880
## STDep 9.8055
## PES 4.9971
## BloodVesNo 13.7218
## thal 12.9828
As can be seen, the importance of any variable to a particular sex doesnt seem to vary much when compared to the other sex.
When we look at the 'MeanDecreaseAccuracy', which measures how important a variable is at reducing the classification error rate, we see that the 'thal', 'BloodVesNo' and 'CPType' are the most important variables. This is similar to what we observed in the decision tree.
Additionally, if we look at the 'MeanDecreaseGini', which measure how important each variable is at separating the data, we see again that 'thal', 'BloodVesNo' and 'CPType' are very important. But so are 'STDep' and 'MaxHR'.
Cross Validation for Random Forest
Random forests also have one last parameter that we can tune.
It is called “mtry”, it is basically the number of predictors sampled by each splitting node. It's default value is 3.
We will use cross-validation to pick this parameter.
You may need to install the caret package #install.packages(“caret”). If caret doesn't behave as expected, you may be missing one of its 21 dependacies.
You can install them through #install.packages(“caret”, dependencies = c(“Depends”, “Suggests”))
THIS IS A LONG INSTALL ~15 minutes
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(20)
controller = trainControl(method = "cv", number = 10, repeats = 1, returnResamp = "final")
cv.rf.trainD = train(HD ~ ., data = trainD, method = "rf", trControl = controller,
ntree = 110, importance = TRUE, tuneLength = 17)
cv.rf.pred = predict(cv.rf.trainD, testD[, c(1:13)])
with(testD, table(cv.rf.pred, testD$HD))
##
## cv.rf.pred 0 1
## 0 20 1
## 1 4 25
(20 + 25)/50
## [1] 0.9
10-fold Cross-validation on the mtry parameter doesn't seem to have improved the classification rate.
That means that the default value was a good value to use.
Lets see details for this model
cv.rf.trainD$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 110, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 110
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20%
## Confusion matrix:
## 0 1 class.error
## 0 71 25 0.2604
## 1 19 105 0.1532
This basically re-iterates the default values the 'mtry' parameter. We have the same OOB and the same mtry. The only difference is that this was achieved via cross-validation.
Lets look in detail at how the values of mtry varied and what significance they had.
cv.rf.trainD
## Random Forest
##
## 220 samples
## 13 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 199, 199, 198, 198, 197, 198, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.8 0.6 0.1 0.3
## 3 0.8 0.6 0.1 0.3
## 4 0.8 0.6 0.1 0.2
## 5 0.8 0.6 0.1 0.3
## 6 0.8 0.6 0.1 0.3
## 7 0.8 0.6 0.1 0.3
## 8 0.8 0.6 0.1 0.3
## 9 0.8 0.6 0.2 0.3
## 10 0.8 0.6 0.1 0.3
## 10 0.8 0.6 0.1 0.3
## 10 0.8 0.6 0.1 0.2
## 10 0.8 0.6 0.1 0.3
## 10 0.8 0.6 0.1 0.3
## 20 0.8 0.6 0.1 0.3
## 20 0.8 0.6 0.1 0.3
## 20 0.8 0.6 0.1 0.2
## 20 0.8 0.6 0.1 0.3
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
All values of mtry have the same accuracy and kappa. This means that the 'mtry' tuning parameter is insignificant for this forest.
Now lets see what variables it thought were significant.
varImp(cv.rf.trainD)
## rf variable importance
##
## Importance
## BloodVesNo 100.0
## thal7 94.4
## CPType4 87.0
## STDep 52.2
## MaxHR 51.4
## PES.Q 47.8
## Sex1 44.0
## PES.L 41.4
## EIA1 35.7
## CPType3 34.6
## thal6 33.5
## RestingECG2 17.2
## CPType2 16.9
## SerumChol 15.1
## RestingECG1 13.5
## Age 12.4
## RestingBP 12.2
## FBS1 0.0
We see some of the variables that are most significant at separating the data, but in more detail than before.
Our data set and variable count is too small to warrant investigation of other classification techiniques such as Boosting
Cross-Validated - Neural Networks
Neural networks are very differrent from the tree-based approaches so far.
This neural network has one hidden layer.
It has only two tuning parameters (neurons in hidden layer and weight decay).
Lets use 10-fold cross validation to select the number of neurons in the hidden layer and weight decay.
You may need to install the neural network package nnet. #install.packages(“nnet”)
require("nnet")
## Loading required package: nnet
controller = trainControl(method = "cv", number = 10, repeats = 1, returnResamp = "final")
nn.net = train(trainD$HD ~ ., method = "nnet", data = trainD, trControl = controller,
finalModel = T, trace = FALSE)
nn.pred = predict(nn.net, testD[, c(1:13)])
with(testD, table(nn.pred, testD$HD))
##
## nn.pred 0 1
## 0 20 3
## 1 4 23
(20 + 23)/50
## [1] 0.86
So our best performance on the neural network is 86% percent using a neural network.
Lets look at this model in more detail
nn.net$bestTune
## size decay
## 6 3 0.1
nn.net$finalModel
## a 18-3-1 network with 61 weights
## inputs: Age Sex1 CPType2 CPType3 CPType4 RestingBP SerumChol FBS1 RestingECG1 RestingECG2 MaxHR EIA1 STDep PES.L PES.Q BloodVesNo thal6 thal7
## output(s): .outcome
## options were - entropy fitting decay=0.1
Our best settings are 3 hidden units and a 0.1 weight decay.
Again, if you really need to know the details on how this choice is made then you can look at:
plot(nn.net)
nn.net$results
## size decay Accuracy Kappa AccuracySD KappaSD
## 1 1 0e+00 0.5876 0.07154 0.07859 0.1892
## 2 1 1e-04 0.6477 0.19979 0.13233 0.3225
## 3 1 1e-01 0.7970 0.57519 0.12207 0.2725
## 4 3 0e+00 0.6405 0.21512 0.11149 0.2760
## 5 3 1e-04 0.5855 0.05418 0.06999 0.1713
## 6 3 1e-01 0.8363 0.66245 0.09912 0.2073
## 7 5 0e+00 0.6561 0.25220 0.13016 0.3112
## 8 5 1e-04 0.6387 0.20945 0.11054 0.2818
## 9 5 1e-01 0.8319 0.65634 0.05326 0.1056
Let us see what our final neural network looks like:
You will need the nnet.plot function : #https://gist.github.com/fawda123/7471137
plot(nn.net$finalModel)
## Loading required package: scales
## Loading required package: reshape
## Loading required package: plyr
##
## Attaching package: 'reshape'
##
## The following objects are masked from 'package:plyr':
##
## rename, round_any
There is something peculiar about our neural network. Sex has suddenly become a very important predictor (highlighted by a dark thick line).
This is inconsistent with our previous models.
Normalising the data
We need to scale the data, this was not an issue when dealing with decision trees and random forests. But it becomes an issue when dealing with neural networks
We will basically normalise all real values and retrain the network.
trainD$Age <- scale(trainD$Age)
trainD$RestingBP <- scale(trainD$RestingBP)
trainD$SerumChol <- scale(trainD$SerumChol)
trainD$MaxHR <- scale(trainD$MaxHR)
trainD$STDep <- scale(trainD$STDep)
trainD$BloodVesNo <- scale(trainD$BloodVesNo)
testD$Age <- scale(testD$Age)
testD$RestingBP <- scale(testD$RestingBP)
testD$SerumChol <- scale(testD$SerumChol)
testD$MaxHR <- scale(testD$MaxHR)
testD$STDep <- scale(testD$STDep)
testD$BloodVesNo <- scale(testD$BloodVesNo)
controller = trainControl(method = "cv", number = 10, repeats = 1, returnResamp = "final")
nn.net = train(trainD$HD ~ ., method = "nnet", data = trainD, trControl = controller,
finalModel = T, trace = FALSE)
nn.pred = predict(nn.net, testD[, c(1:13)])
with(testD, table(nn.pred, testD$HD))
##
## nn.pred 0 1
## 0 19 1
## 1 5 25
(19 + 25)/50
## [1] 0.88
After rescaling, our classification rate is now up to 88%.
nn.net$bestTune
## size decay
## 3 1 0.1
nn.net$finalModel
## a 18-1-1 network with 21 weights
## inputs: Age Sex1 CPType2 CPType3 CPType4 RestingBP SerumChol FBS1 RestingECG1 RestingECG2 MaxHR EIA1 STDep PES.L PES.Q BloodVesNo thal6 thal7
## output(s): .outcome
## options were - entropy fitting decay=0.1
Our model is also much simpler, with only one neuron in the hidden layer and 21 weights.
Lets also plot how the differrent values for weight decay affect the accuracy for differrent values of the hidden weights.
plot(nn.net)
nn.net$results
## size decay Accuracy Kappa AccuracySD KappaSD
## 1 1 0e+00 0.8129 0.6249 0.08079 0.1577
## 2 1 1e-04 0.8127 0.6281 0.07734 0.1497
## 3 1 1e-01 0.8356 0.6653 0.06069 0.1266
## 4 3 0e+00 0.7581 0.5098 0.07456 0.1490
## 5 3 1e-04 0.8224 0.6398 0.05565 0.1153
## 6 3 1e-01 0.8131 0.6192 0.08597 0.1793
## 7 5 0e+00 0.7676 0.5274 0.06438 0.1348
## 8 5 1e-04 0.7546 0.5028 0.07113 0.1449
## 9 5 1e-01 0.8124 0.6178 0.07658 0.1624
Let us also plot our final neural network structure.
plot(nn.net$finalModel)
Note that the black lines are positive weights while the gray ones are negative. Additionally, there thickness represents there magnitude.
We can see that Sex is no longer an overly emphasised predictor.
We can also view our weights.
plot(nn.net$finalModel, wts.only = T)
## $`hidden 1 1`
## [1] 2.82994 0.39039 -1.13632 -1.21786 -0.04779 -2.01690 -0.55280
## [8] -0.23061 -0.08214 -0.01823 0.04503 0.72545 -0.44754 -0.33525
## [15] 0.36162 1.19904 -1.51615 0.03882 -1.71760
##
## $`out 1`
## [1] -2.669 6.000
Note that the first (top) value is always the bias weight. And we just read the weights going into each layer from top to bottom.
With this, we complete our investigation and recommend all of these models for the classifcation of heart disease problem. But we ramain biased towards the tree based methods as they will scale alot better.
End of Investigation.
Copyright © 2014 Emmanuel Kahembwe