In this work, the author uses various machine learning techniques to perform prediction on quality of barbell lifts from sensor data collected by fit devices. The data of this project comes from the HAR project.
The training data contains 19622 observations of 160 variables, and the goal is to predict the classe
variable, which is a factor or 5 levels, labeled A
~E
. We can treat this as a multi-class classification problem.
We read the training data, and select the variables that are from the sensors. We drop some features such as experiment time, and subject name, as they should not influence our results.
dat <- read.table("pml-training.csv", sep = ',', header = T)
We drop some features such as experiment time, and subject name, as they should not influence our results.
features <- c(8:159)
dat <- dat %>% mutate_each(funs(as.numeric), features) %>%
select(c(features,classe))
We plan to split the data by 60%/20%/20% for training, individual model testing/stacking, and validation . This is to make sure that we do not over fit the training data, or underestimate the out-of-sample error.
set.seed(4869)
inTrain <- createDataPartition(dat$classe, p = 0.8)[[1]]
training <- dat[inTrain,]
validating <- dat[-inTrain,]
inTrain <- createDataPartition(training$classe, p = 0.75)[[1]]
testing <- training[-inTrain,]
training <- training[inTrain,]
We use the prePressing
function in caret
package to center and scale our data, and perform impute using knn
to remove the NAs.
preObj <- preProcess(training[-153], method = c("center", "scale","knnImpute"))
trainp <- predict(preObj, training)
testingp <- predict(preObj, testing)
Here we use a few different methods to train our model. To enhance performance, we use repeated cross validation with 5 folds and 3 repeats for each methods. Note that to ensure reprehensibility, we set the random seed just prior to the training process.
train_control <- trainControl(method="repeatedcv", number = 5, repeats= 3)
set.seed(1237)
mdl1 <- train(classe~., trainp, trControl=train_control, method = "gbm", verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
pred1 <- predict(mdl1, testingp)
confusionMatrix(pred1, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1099 25 0 3 3
## B 10 720 16 2 12
## C 3 12 658 15 11
## D 4 1 9 611 8
## E 0 1 1 12 687
##
## Overall Statistics
##
## Accuracy : 0.9623
## 95% CI : (0.9558, 0.968)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9523
## Mcnemar's Test P-Value : 0.0002317
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9848 0.9486 0.9620 0.9502 0.9528
## Specificity 0.9890 0.9874 0.9873 0.9933 0.9956
## Pos Pred Value 0.9726 0.9474 0.9413 0.9652 0.9800
## Neg Pred Value 0.9939 0.9877 0.9919 0.9903 0.9894
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2801 0.1835 0.1677 0.1557 0.1751
## Detection Prevalence 0.2880 0.1937 0.1782 0.1614 0.1787
## Balanced Accuracy 0.9869 0.9680 0.9747 0.9718 0.9742
set.seed(5289)
mdl2 <- train(classe~., trainp, trControl=train_control, method = "AdaBag", verbose = FALSE)
## Loading required package: adabag
## Loading required package: rpart
## Loading required package: mlbench
pred2 <- predict(mdl2, testingp)
confusionMatrix(pred2, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1090 507 663 517 273
## B 26 252 21 126 124
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 0 324
##
## Overall Statistics
##
## Accuracy : 0.4247
## 95% CI : (0.4091, 0.4403)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2189
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9767 0.33202 0.0000 0.0000 0.44938
## Specificity 0.3017 0.90613 1.0000 1.0000 1.00000
## Pos Pred Value 0.3574 0.45902 NaN NaN 1.00000
## Neg Pred Value 0.9702 0.84973 0.8256 0.8361 0.88969
## Prevalence 0.2845 0.19347 0.1744 0.1639 0.18379
## Detection Rate 0.2778 0.06424 0.0000 0.0000 0.08259
## Detection Prevalence 0.7775 0.13994 0.0000 0.0000 0.08259
## Balanced Accuracy 0.6392 0.61907 0.5000 0.5000 0.72469
The adabag seem to perform pooly in this configuration. This is likely due to poor parameter choices or overfitting the training data. We could improve this by running more cross-validation. However, that would take a lot of computation time. Since we have other models with great performance and we use model stacking to form the final model, this would not be a big problem.
set.seed(5420)
mdl3 <- train(classe~., trainp, trControl=train_control, method = "rf", verbose = FALSE)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
pred3 <- predict(mdl3, testingp)
confusionMatrix(pred3, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 5 0 0 0
## B 4 750 3 0 1
## C 0 4 674 3 4
## D 0 0 7 639 3
## E 0 0 0 1 713
##
## Overall Statistics
##
## Accuracy : 0.9911
## 95% CI : (0.9876, 0.9938)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9887
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9881 0.9854 0.9938 0.9889
## Specificity 0.9982 0.9975 0.9966 0.9970 0.9997
## Pos Pred Value 0.9955 0.9894 0.9839 0.9846 0.9986
## Neg Pred Value 0.9986 0.9972 0.9969 0.9988 0.9975
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1912 0.1718 0.1629 0.1817
## Detection Prevalence 0.2847 0.1932 0.1746 0.1654 0.1820
## Balanced Accuracy 0.9973 0.9928 0.9910 0.9954 0.9943
Previous results show that the gdm
and rf
methods give very accurate results, while the prediction from AdaBag
is not as satisfactory. We use a RandomForest to combine the predictors and form our final model.
votes <- data.frame(pred1, pred2, pred3, y = testing$classe)
mdl4 <- train(y~., votes, method = "rf")
Here We use the validation set to estimate the out-of-sample error for the final model.
v <- predict(preObj, validating)
p1 <- predict(mdl1, v)
p2 <- predict(mdl2, v)
p3 <- predict(mdl3, v)
p <- predict(mdl4,data.frame(pred1 = p1, pred2 = p2,pred3 = p3))
confusionMatrix(p, validating$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1113 7 0 0 0
## B 3 751 3 0 1
## C 0 1 675 5 1
## D 0 0 6 636 1
## E 0 0 0 2 718
##
## Overall Statistics
##
## Accuracy : 0.9924
## 95% CI : (0.9891, 0.9948)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9973 0.9895 0.9868 0.9891 0.9958
## Specificity 0.9975 0.9978 0.9978 0.9979 0.9994
## Pos Pred Value 0.9938 0.9908 0.9897 0.9891 0.9972
## Neg Pred Value 0.9989 0.9975 0.9972 0.9979 0.9991
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2837 0.1914 0.1721 0.1621 0.1830
## Detection Prevalence 0.2855 0.1932 0.1738 0.1639 0.1835
## Balanced Accuracy 0.9974 0.9936 0.9923 0.9935 0.9976
The overall accuracy is 99.2%.
In this work, we studied the HAR data and make predictions on the quality of personal activities. We split the data by 60%/20%/20% for training, testing and validation, and trained three models. We then stack the models and form a final prediction model. We tested the final model with the validation data and find that the overall accuracy is 99.2%.