New Package for Ensembling R Models
I’ve written a new R package called caretEnsemble for creating ensembles of caret models in R. It currently works well for regression models, and I’ve written some preliminary support for binary classification models.
At this point, I’ve got 2 different algorithms for combining models:
- Greedy stepwise ensembles (returns a weight for each model)
- Stacks of caret models
(You can also manually specify weights for a greedy ensemble)
The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package on github. The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.
All the models in the ensemble must use the same training/test folds. Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here’s a brief script demonstrating how to use the package:
#Setup
rm(list = ls(all = TRUE))
gc(reset=TRUE)
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 485751 26 1037641 55.5 NA 485751 26
#> Vcells 911819 7 8388608 64.0 98304 911819 7
set.seed(42)
#Libraries
library(caret)
library(devtools)
library(caretEnsemble)
#Data
library(mlbench)
data(BostonHousing2)
X <- model.matrix(cmedv~crim+zn+indus+chas+nox+rm+age+dis+
rad+tax+ptratio+b+lstat+lat+lon, BostonHousing2)[,-1]
X <- data.frame(X)
Y <- BostonHousing2$cmedv
#Split train/test
train <- runif(nrow(X)) <= .66
#Setup CV Folds
#returnData=FALSE saves some space
folds=5
repeats=1
myControl <- trainControl(method='cv', number=folds, repeats=repeats, returnResamp='none',
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
index=createMultiFolds(Y[train], k=folds, times=repeats))
PP <- c('center', 'scale')
#Train some models
model1 <- train(X[train,], Y[train], method='gbm', verbose=FALSE, trControl=myControl,
tuneGrid=expand.grid(.n.trees=500, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode=10))
model2 <- train(X[train,], Y[train], method='blackboost', trControl=myControl)
model3 <- train(X[train,], Y[train], method='parRF', trControl=myControl)
model4 <- train(X[train,], Y[train], method='mlpWeightDecay', trControl=myControl, trace=FALSE, preProcess=PP)
model5 <- train(X[train,], Y[train], method='ppr', trControl=myControl, preProcess=PP)
model6 <- train(X[train,], Y[train], method='earth', trControl=myControl, preProcess=PP)
model7 <- train(X[train,], Y[train], method='glm', trControl=myControl, preProcess=PP)
model8 <- train(X[train,], Y[train], method='svmRadial', trControl=myControl, preProcess=PP)
model9 <- train(X[train,], Y[train], method='gam', trControl=myControl, preProcess=PP)
model10 <- train(X[train,], Y[train], method='glmnet', trControl=myControl, preProcess=PP)
#Make a list of all the models
all.models <- list(model1, model2, model3, model4, model5, model6, model7, model8, model9, model10)
names(all.models) <- sapply(all.models, function(x) x$method)
all.models <- as.caretList(all.models)
sort(sapply(all.models, function(x) min(x$results$RMSE)))
#> parRF gbm earth ppr blackboost gam svmRadial glmnet
#> 3.306669 3.478900 3.670235 3.818690 3.888422 3.905597 4.167405 4.921640
#> glm mlpWeightDecay
#> 4.922111 7.377662
#Make a greedy ensemble - currently can only use RMSE
greedy <- caretEnsemble(all.models, iter=1000L)
sort(greedy$weights, decreasing=TRUE)
#> NULL
greedy$error
#> max_iter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
#> 1 100 3.29449 0.8808035 2.154081 0.3982098 0.05101488 0.124263
#Make a linear regression ensemble
linear <- caretStack(all.models, method='glm', trControl=trainControl(method='cv'))
summary(linear$ens_model$finalModel)
#>
#> Call:
#> NULL
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.84504 0.56565 -1.494 0.136143
#> gbm -0.13031 0.16813 -0.775 0.438859
#> blackboost -0.32620 0.11365 -2.870 0.004364 **
#> parRF 1.04531 0.18591 5.623 3.99e-08 ***
#> mlpWeightDecay -0.02962 0.03125 -0.948 0.343782
#> ppr 0.16638 0.07715 2.157 0.031759 *
#> earth 0.42152 0.07696 5.477 8.55e-08 ***
#> glm 9.09473 2.76016 3.295 0.001090 **
#> svmRadial 0.20700 0.07664 2.701 0.007269 **
#> gam -0.19904 0.08942 -2.226 0.026686 *
#> glmnet -9.21373 2.77226 -3.324 0.000988 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 9.305633)
#>
#> Null deviance: 30592.2 on 342 degrees of freedom
#> Residual deviance: 3089.5 on 332 degrees of freedom
#> AIC: 1751.3
#>
#> Number of Fisher Scoring iterations: 2
linear$error
#> parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
#> 1 none 3.12004 0.8949955 2.114804 0.8736797 0.04463439 0.3865752
#Predict for test set:
preds <- predict(all.models, newdata=X[!train,])
preds$ENS_greedy <- predict(greedy, newdata=X[!train,])
preds$ENS_linear <- predict(linear, newdata=X[!train,])
sort(sqrt(colMeans((preds - Y[!train]) ^ 2)))
#> ENS_linear ENS_greedy parRF gbm svmRadial ppr gam earth
#> 2.996924 3.057801 3.186682 3.214948 3.437198 3.551648 3.596338 3.655737
#> blackboost glmnet glm mlpWeightDecay
#> 3.804047 4.490592 4.493315 6.965890
Please feel free to submit any comments here or on github. I’d also be happy to include any patches you feel like submitting. In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.