3.10 Wrapping up

Feature engineering is a gigantic topic. The purpose of this chapter was to provide a basic introduction. We illustrated creating a recipe or blueprint for your feature engineering, which can be applied iteratively to slices of the data (as in a k-fold cross-validation approach) or to new data with the same structure. This process helps guard against data leakage, where information from the test set leaks into the training set. Let’s look at a quick example of this by using PCA to retain 95% of the total variation in the data. We’ll use \(k\)-fold cross-validation and see how many components we retain in each fold. First, we’ll create the recipe, which is essentially equivalent to what we saw in the previous chapter

rec <- recipe(score ~ ., train) %>% 
  step_mutate(tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_unknown(all_nominal()) %>%
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
  step_normalize(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% 
  step_dummy(all_nominal(), -has_role("id vars")) %>% 
  step_pca(all_numeric(), -all_outcomes(), -has_role("id vars"), 
           threshold = .95)

Now let’s create our \(k\)-fold cross-validation object.

set.seed(42)
cv <- vfold_cv(train)
cv
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [2.6K/285]> Fold01
##  2 <split [2.6K/284]> Fold02
##  3 <split [2.6K/284]> Fold03
##  4 <split [2.6K/284]> Fold04
##  5 <split [2.6K/284]> Fold05
##  6 <split [2.6K/284]> Fold06
##  7 <split [2.6K/284]> Fold07
##  8 <split [2.6K/284]> Fold08
##  9 <split [2.6K/284]> Fold09
## 10 <split [2.6K/284]> Fold10

Now we’ll loop through each split and pull out the data we will build our model on, for that fold

cv %>% 
  mutate(assessment_data = map(splits, assessment))
## #  10-fold cross-validation 
## # A tibble: 10 x 3
##    splits             id     assessment_data    
##    <list>             <chr>  <list>             
##  1 <split [2.6K/285]> Fold01 <tibble [285 × 39]>
##  2 <split [2.6K/284]> Fold02 <tibble [284 × 39]>
##  3 <split [2.6K/284]> Fold03 <tibble [284 × 39]>
##  4 <split [2.6K/284]> Fold04 <tibble [284 × 39]>
##  5 <split [2.6K/284]> Fold05 <tibble [284 × 39]>
##  6 <split [2.6K/284]> Fold06 <tibble [284 × 39]>
##  7 <split [2.6K/284]> Fold07 <tibble [284 × 39]>
##  8 <split [2.6K/284]> Fold08 <tibble [284 × 39]>
##  9 <split [2.6K/284]> Fold09 <tibble [284 × 39]>
## 10 <split [2.6K/284]> Fold10 <tibble [284 × 39]>

And finally, apply the recipe to each dataset

cv %>% 
  mutate(assessment_data = map(splits, assessment),
         baked_data = map(assessment_data, ~prep(rec) %>%  bake(.x)))
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     assessment_data     baked_data         
##    <list>             <chr>  <list>              <list>             
##  1 <split [2.6K/285]> Fold01 <tibble [285 × 39]> <tibble [285 × 17]>
##  2 <split [2.6K/284]> Fold02 <tibble [284 × 39]> <tibble [284 × 17]>
##  3 <split [2.6K/284]> Fold03 <tibble [284 × 39]> <tibble [284 × 17]>
##  4 <split [2.6K/284]> Fold04 <tibble [284 × 39]> <tibble [284 × 17]>
##  5 <split [2.6K/284]> Fold05 <tibble [284 × 39]> <tibble [284 × 17]>
##  6 <split [2.6K/284]> Fold06 <tibble [284 × 39]> <tibble [284 × 17]>
##  7 <split [2.6K/284]> Fold07 <tibble [284 × 39]> <tibble [284 × 17]>
##  8 <split [2.6K/284]> Fold08 <tibble [284 × 39]> <tibble [284 × 17]>
##  9 <split [2.6K/284]> Fold09 <tibble [284 × 39]> <tibble [284 × 17]>
## 10 <split [2.6K/284]> Fold10 <tibble [284 × 39]> <tibble [284 × 17]>

Let’s see how many principal components are in each data fold

cv %>% 
  mutate(assessment_data = map(splits, assessment),
         baked_data = map(assessment_data, ~prep(rec) %>%  bake(.x)),
         n_components = map_int(baked_data, ~sum(grepl("^PC", names(.x)))))
## #  10-fold cross-validation 
## # A tibble: 10 x 5
##    splits             id     assessment_data     baked_data         n_components
##    <list>             <chr>  <list>              <list>                    <int>
##  1 <split [2.6K/285]> Fold01 <tibble [285 × 39]> <tibble [285 × 17…            9
##  2 <split [2.6K/284]> Fold02 <tibble [284 × 39]> <tibble [284 × 17…            9
##  3 <split [2.6K/284]> Fold03 <tibble [284 × 39]> <tibble [284 × 17…            9
##  4 <split [2.6K/284]> Fold04 <tibble [284 × 39]> <tibble [284 × 17…            9
##  5 <split [2.6K/284]> Fold05 <tibble [284 × 39]> <tibble [284 × 17…            9
##  6 <split [2.6K/284]> Fold06 <tibble [284 × 39]> <tibble [284 × 17…            9
##  7 <split [2.6K/284]> Fold07 <tibble [284 × 39]> <tibble [284 × 17…            9
##  8 <split [2.6K/284]> Fold08 <tibble [284 × 39]> <tibble [284 × 17…            9
##  9 <split [2.6K/284]> Fold09 <tibble [284 × 39]> <tibble [284 × 17…            9
## 10 <split [2.6K/284]> Fold10 <tibble [284 × 39]> <tibble [284 × 17…            9

In this case, the number of components is the same across all folds, but the representation of those components is likely slightly different. This is because the recipe has been applied to each fold, independent of the other folds.

Feature engineering can also regularly feel overwhelming. There are so many possible options that you could do to potentially increase model performance. This chapter just scratched the surface. The sheer number of decisions is why many believe that feature engineering is the most “art” part of machine learning. Strong domain knowledge will absolutely help, as will practice. For those looking to go further with feature engineering, we suggest Kuhn and Johnson, who wrote an entire book on the topic. It provides practical advice in an accessible way without skimping on important details.