Predicting barbell lifts class with activity data

Overview

In this work, the author uses various machine learning techniques to perform prediction on quality of barbell lifts from sensor data collected by fit devices. The data of this project comes from the HAR project.

The training data contains 19622 observations of 160 variables, and the goal is to predict the classe variable, which is a factor or 5 levels, labeled A~E. We can treat this as a multi-class classification problem.

Preparing Data

We read the training data, and select the variables that are from the sensors. We drop some features such as experiment time, and subject name, as they should not influence our results.

dat <- read.table("pml-training.csv", sep = ',', header = T)

We drop some features such as experiment time, and subject name, as they should not influence our results.

features <-  c(8:159)
dat <- dat %>% mutate_each(funs(as.numeric), features) %>%
        select(c(features,classe))

Splitting data

We plan to split the data by 60%/20%/20% for training, individual model testing/stacking, and validation . This is to make sure that we do not over fit the training data, or underestimate the out-of-sample error.

set.seed(4869)
inTrain <-  createDataPartition(dat$classe, p = 0.8)[[1]]
training <- dat[inTrain,]
validating <- dat[-inTrain,]
inTrain <-  createDataPartition(training$classe, p = 0.75)[[1]]
testing <- training[-inTrain,]
training <- training[inTrain,]

Preprocessing

We use the prePressing function in caret package to center and scale our data, and perform impute using knn to remove the NAs.

preObj <- preProcess(training[-153], method = c("center", "scale","knnImpute"))
trainp <- predict(preObj, training)
testingp <- predict(preObj, testing)

Training different machine learning methods

Here we use a few different methods to train our model. To enhance performance, we use repeated cross validation with 5 folds and 3 repeats for each methods. Note that to ensure reprehensibility, we set the random seed just prior to the training process.

Generalized Boosted Regression Models

train_control <- trainControl(method="repeatedcv", number = 5, repeats= 3)
set.seed(1237)
mdl1 <- train(classe~., trainp, trControl=train_control, method = "gbm", verbose = FALSE)

## Loading required package: gbm

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.3

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:lubridate':
## 
##     here

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

pred1 <- predict(mdl1, testingp)
confusionMatrix(pred1, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1099   25    0    3    3
##          B   10  720   16    2   12
##          C    3   12  658   15   11
##          D    4    1    9  611    8
##          E    0    1    1   12  687
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9623         
##                  95% CI : (0.9558, 0.968)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9523         
##  Mcnemar's Test P-Value : 0.0002317      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9848   0.9486   0.9620   0.9502   0.9528
## Specificity            0.9890   0.9874   0.9873   0.9933   0.9956
## Pos Pred Value         0.9726   0.9474   0.9413   0.9652   0.9800
## Neg Pred Value         0.9939   0.9877   0.9919   0.9903   0.9894
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2801   0.1835   0.1677   0.1557   0.1751
## Detection Prevalence   0.2880   0.1937   0.1782   0.1614   0.1787
## Balanced Accuracy      0.9869   0.9680   0.9747   0.9718   0.9742

AdaBag

set.seed(5289)
mdl2 <- train(classe~., trainp, trControl=train_control, method = "AdaBag", verbose = FALSE)

## Loading required package: adabag

## Loading required package: rpart

## Loading required package: mlbench

pred2 <- predict(mdl2, testingp)
confusionMatrix(pred2, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1090  507  663  517  273
##          B   26  252   21  126  124
##          C    0    0    0    0    0
##          D    0    0    0    0    0
##          E    0    0    0    0  324
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4247          
##                  95% CI : (0.4091, 0.4403)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2189          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9767  0.33202   0.0000   0.0000  0.44938
## Specificity            0.3017  0.90613   1.0000   1.0000  1.00000
## Pos Pred Value         0.3574  0.45902      NaN      NaN  1.00000
## Neg Pred Value         0.9702  0.84973   0.8256   0.8361  0.88969
## Prevalence             0.2845  0.19347   0.1744   0.1639  0.18379
## Detection Rate         0.2778  0.06424   0.0000   0.0000  0.08259
## Detection Prevalence   0.7775  0.13994   0.0000   0.0000  0.08259
## Balanced Accuracy      0.6392  0.61907   0.5000   0.5000  0.72469

The adabag seem to perform pooly in this configuration. This is likely due to poor parameter choices or overfitting the training data. We could improve this by running more cross-validation. However, that would take a lot of computation time. Since we have other models with great performance and we use model stacking to form the final model, this would not be a big problem.

Random Forest

set.seed(5420)
mdl3 <- train(classe~., trainp, trControl=train_control, method = "rf", verbose = FALSE)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

pred3 <- predict(mdl3, testingp)
confusionMatrix(pred3, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    5    0    0    0
##          B    4  750    3    0    1
##          C    0    4  674    3    4
##          D    0    0    7  639    3
##          E    0    0    0    1  713
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9911          
##                  95% CI : (0.9876, 0.9938)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9887          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9881   0.9854   0.9938   0.9889
## Specificity            0.9982   0.9975   0.9966   0.9970   0.9997
## Pos Pred Value         0.9955   0.9894   0.9839   0.9846   0.9986
## Neg Pred Value         0.9986   0.9972   0.9969   0.9988   0.9975
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1912   0.1718   0.1629   0.1817
## Detection Prevalence   0.2847   0.1932   0.1746   0.1654   0.1820
## Balanced Accuracy      0.9973   0.9928   0.9910   0.9954   0.9943

Stacking models and combining predictors

Previous results show that the gdm and rf methods give very accurate results, while the prediction from AdaBag is not as satisfactory. We use a RandomForest to combine the predictors and form our final model.

votes <- data.frame(pred1, pred2, pred3, y = testing$classe)
mdl4 <- train(y~., votes, method = "rf")

Out-of-sample error

Here We use the validation set to estimate the out-of-sample error for the final model.

v <- predict(preObj, validating)
p1 <- predict(mdl1, v)
p2 <- predict(mdl2, v)
p3 <- predict(mdl3, v)
p <- predict(mdl4,data.frame(pred1 = p1, pred2 = p2,pred3 = p3))
confusionMatrix(p, validating$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1113    7    0    0    0
##          B    3  751    3    0    1
##          C    0    1  675    5    1
##          D    0    0    6  636    1
##          E    0    0    0    2  718
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9891, 0.9948)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9895   0.9868   0.9891   0.9958
## Specificity            0.9975   0.9978   0.9978   0.9979   0.9994
## Pos Pred Value         0.9938   0.9908   0.9897   0.9891   0.9972
## Neg Pred Value         0.9989   0.9975   0.9972   0.9979   0.9991
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2837   0.1914   0.1721   0.1621   0.1830
## Detection Prevalence   0.2855   0.1932   0.1738   0.1639   0.1835
## Balanced Accuracy      0.9974   0.9936   0.9923   0.9935   0.9976

The overall accuracy is 99.2%.

Conclusion

In this work, we studied the HAR data and make predictions on the quality of personal activities. We split the data by 60%/20%/20% for training, testing and validation, and trained three models. We then stack the models and form a final prediction model. We tested the final model with the validation data and find that the overall accuracy is 99.2%.