3.10 Wrapping up
Feature engineering is a gigantic topic. The purpose of this chapter was to provide a basic introduction. We illustrated creating a recipe or blueprint for your feature engineering, which can be applied iteratively to slices of the data (as in a k-fold cross-validation approach) or to new data with the same structure. This process helps guard against data leakage, where information from the test set leaks into the training set. Let’s look at a quick example of this by using PCA to retain 95% of the total variation in the data. We’ll use \(k\)-fold cross-validation and see how many components we retain in each fold. First, we’ll create the recipe, which is essentially equivalent to what we saw in the previous chapter
recipe(score ~ ., train) %>%
rec <- step_mutate(tst_dt = lubridate::mdy_hms(tst_dt)) %>%
update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_zv(all_predictors()) %>%
step_unknown(all_nominal()) %>%
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
step_normalize(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
step_dummy(all_nominal(), -has_role("id vars")) %>%
step_pca(all_numeric(), -all_outcomes(), -has_role("id vars"),
threshold = .95)
Now let’s create our \(k\)-fold cross-validation object.
set.seed(42)
vfold_cv(train)
cv <- cv
## # 10-fold cross-validation
## [90m# A tibble: 10 x 2[39m
## splits id
## [3m[90m<list>[39m[23m [3m[90m<chr>[39m[23m
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10
Now we’ll loop through each split and pull out the data we will build our model on, for that fold
%>%
cv mutate(assessment_data = map(splits, assessment))
## # 10-fold cross-validation
## [90m# A tibble: 10 x 3[39m
## splits id assessment_data
## [3m[90m<list>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01 [90m<tibble [285 × 39]>[39m
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02 [90m<tibble [284 × 39]>[39m
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03 [90m<tibble [284 × 39]>[39m
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04 [90m<tibble [284 × 39]>[39m
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05 [90m<tibble [284 × 39]>[39m
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06 [90m<tibble [284 × 39]>[39m
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07 [90m<tibble [284 × 39]>[39m
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08 [90m<tibble [284 × 39]>[39m
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09 [90m<tibble [284 × 39]>[39m
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10 [90m<tibble [284 × 39]>[39m
And finally, apply the recipe to each dataset
%>%
cv mutate(assessment_data = map(splits, assessment),
baked_data = map(assessment_data, ~prep(rec) %>% bake(.x)))
## # 10-fold cross-validation
## [90m# A tibble: 10 x 4[39m
## splits id assessment_data baked_data
## [3m[90m<list>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m [3m[90m<list>[39m[23m
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01 [90m<tibble [285 × 39]>[39m [90m<tibble [285 × 17]>[39m
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
Let’s see how many principal components are in each data fold
%>%
cv mutate(assessment_data = map(splits, assessment),
baked_data = map(assessment_data, ~prep(rec) %>% bake(.x)),
n_components = map_int(baked_data, ~sum(grepl("^PC", names(.x)))))
## # 10-fold cross-validation
## [90m# A tibble: 10 x 5[39m
## splits id assessment_data baked_data n_components
## [3m[90m<list>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m [3m[90m<list>[39m[23m [3m[90m<int>[39m[23m
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01 [90m<tibble [285 × 39]>[39m [90m<tibble [285 × 17[0m… 9
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m… 9
In this case, the number of components is the same across all folds, but the representation of those components is likely slightly different. This is because the recipe has been applied to each fold, independent of the other folds.
Feature engineering can also regularly feel overwhelming. There are so many possible options that you could do to potentially increase model performance. This chapter just scratched the surface. The sheer number of decisions is why many believe that feature engineering is the most “art” part of machine learning. Strong domain knowledge will absolutely help, as will practice. For those looking to go further with feature engineering, we suggest Kuhn and Johnson, who wrote an entire book on the topic. It provides practical advice in an accessible way without skimping on important details.