3.10 Wrapping up

Feature engineering is a gigantic topic. The purpose of this chapter was to provide a basic introduction. We illustrated creating a recipe or blueprint for your feature engineering, which can be applied iteratively to slices of the data (as in a k-fold cross-validation approach) or to new data with the same structure. This process helps guard against data leakage, where information from the test set leaks into the training set. Let’s look at a quick example of this by using PCA to retain 95% of the total variation in the data. We’ll use \(k\)-fold cross-validation and see how many components we retain in each fold. First, we’ll create the recipe, which is essentially equivalent to what we saw in the previous chapter

rec <- recipe(score ~ ., train) %>% 
  step_mutate(tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_unknown(all_nominal()) %>%
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
  step_normalize(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% 
  step_dummy(all_nominal(), -has_role("id vars")) %>% 
  step_pca(all_numeric(), -all_outcomes(), -has_role("id vars"), 
           threshold = .95)

Now let’s create our \(k\)-fold cross-validation object.

set.seed(42)
cv <- vfold_cv(train)
cv

## #  10-fold cross-validation 
## [90m# A tibble: 10 x 2[39m
##    splits             id    
##    [3m[90m<list>[39m[23m             [3m[90m<chr>[39m[23m 
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10

Now we’ll loop through each split and pull out the data we will build our model on, for that fold

cv %>% 
  mutate(assessment_data = map(splits, assessment))

## #  10-fold cross-validation 
## [90m# A tibble: 10 x 3[39m
##    splits             id     assessment_data    
##    [3m[90m<list>[39m[23m             [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m             
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01 [90m<tibble [285 × 39]>[39m
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02 [90m<tibble [284 × 39]>[39m
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03 [90m<tibble [284 × 39]>[39m
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04 [90m<tibble [284 × 39]>[39m
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05 [90m<tibble [284 × 39]>[39m
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06 [90m<tibble [284 × 39]>[39m
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07 [90m<tibble [284 × 39]>[39m
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08 [90m<tibble [284 × 39]>[39m
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09 [90m<tibble [284 × 39]>[39m
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10 [90m<tibble [284 × 39]>[39m

And finally, apply the recipe to each dataset

cv %>% 
  mutate(assessment_data = map(splits, assessment),
         baked_data = map(assessment_data, ~prep(rec) %>%  bake(.x)))

## #  10-fold cross-validation 
## [90m# A tibble: 10 x 4[39m
##    splits             id     assessment_data     baked_data         
##    [3m[90m<list>[39m[23m             [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m              [3m[90m<list>[39m[23m             
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01 [90m<tibble [285 × 39]>[39m [90m<tibble [285 × 17]>[39m
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17]>[39m

Let’s see how many principal components are in each data fold

cv %>% 
  mutate(assessment_data = map(splits, assessment),
         baked_data = map(assessment_data, ~prep(rec) %>%  bake(.x)),
         n_components = map_int(baked_data, ~sum(grepl("^PC", names(.x)))))

## #  10-fold cross-validation 
## [90m# A tibble: 10 x 5[39m
##    splits             id     assessment_data     baked_data         n_components
##    [3m[90m<list>[39m[23m             [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m              [3m[90m<list>[39m[23m                    [3m[90m<int>[39m[23m
## [90m 1[39m [90m<split [2.6K/285]>[39m Fold01 [90m<tibble [285 × 39]>[39m [90m<tibble [285 × 17[0m…            9
## [90m 2[39m [90m<split [2.6K/284]>[39m Fold02 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 3[39m [90m<split [2.6K/284]>[39m Fold03 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 4[39m [90m<split [2.6K/284]>[39m Fold04 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 5[39m [90m<split [2.6K/284]>[39m Fold05 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 6[39m [90m<split [2.6K/284]>[39m Fold06 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 7[39m [90m<split [2.6K/284]>[39m Fold07 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 8[39m [90m<split [2.6K/284]>[39m Fold08 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m 9[39m [90m<split [2.6K/284]>[39m Fold09 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9
## [90m10[39m [90m<split [2.6K/284]>[39m Fold10 [90m<tibble [284 × 39]>[39m [90m<tibble [284 × 17[0m…            9

In this case, the number of components is the same across all folds, but the representation of those components is likely slightly different. This is because the recipe has been applied to each fold, independent of the other folds.

Feature engineering can also regularly feel overwhelming. There are so many possible options that you could do to potentially increase model performance. This chapter just scratched the surface. The sheer number of decisions is why many believe that feature engineering is the most “art” part of machine learning. Strong domain knowledge will absolutely help, as will practice. For those looking to go further with feature engineering, we suggest Kuhn and Johnson, who wrote an entire book on the topic. It provides practical advice in an accessible way without skimping on important details.