Skip to content

Instantly share code, notes, and snippets.

@bradleyboehmke
Last active March 24, 2021 13:13
Show Gist options
  • Select an option

  • Save bradleyboehmke/7794b79a07afb443da11d930ff84bed7 to your computer and use it in GitHub Desktop.

Select an option

Save bradleyboehmke/7794b79a07afb443da11d930ff84bed7 to your computer and use it in GitHub Desktop.
Observing slight differences in resampling schemes between caret and rsample k-fold implementations
library(AmesHousing)
library(caret)
library(rsample)
ames <- make_ames()
ames <- tibble::rownames_to_column(ames)
# create folds
set.seed(123)
folds_caret <- createFolds(ames$Sale_Price, k = 10, returnTrain = TRUE)
set.seed(123)
folds_rsample <- vfold_cv(ames, v = 10)
# extract first fold
fold1_caret_row <- as.character(folds_caret$Fold01)
fold1_rsample_row <- analysis(folds_rsample$splits[[1]])$rowname
# check the intesect and differences in the rows used for
# the first folds
same <- intersect(fold1_caret_row, fold1_rsample_row)
different <- setdiff(fold1_caret_row, fold1_rsample_row)
# we see that some of the same observations are included in both
length(same)
# but we also see that some different observations are included,
# which would drive slight differences in results
length(different)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment