3.2 Creating a recipe
Let’s read in some data and begin creating a basic recipe. We’ll work with the simulated statewide testing data introduced previously. This is a fairly decent sized dataset, and since we’re just illustrating concepts here, we’ll pull a random sample of 2% of the total data to make everything run a bit quicker. We’ll also remove the classification
variable, which is just a categorical version of score
, our outcome.
In the chunk below, we read in the data, sample a random 2% of the data (being careful to set a seed first so our results are reproducible), split it into training and test sets, and extract just the training dataset. We’ll hold off on splitting it into CV folds for now.
library(tidyverse)
library(tidymodels)
set.seed(8675309)
read_csv("https://github.com/uo-datasci-specialization/c4-ml-fall-2020/raw/master/data/train.csv") %>%
full_train <- slice_sample(prop = 0.02) %>%
select(-classification)
initial_split(full_train)
splt <- training(splt) train <-
A quick reminder, the data look like this
And you can see the full data dictionary on the Kaggle website here.
When creating recipes, we can still use the formula interface to define how the data will be modeled. In this case, we’ll say that the score
column is predicted by everything else in the data frame.
recipe(score ~ ., data = train) rec <-
Notice that I still declare the dataset, even though this is just a blueprint. It uses the dataset I provide to get the names of the columns from the dataset, but it doesn’t actually do anything with this dataset (unless we ask it to). Let’s look at what this recipe looks like
rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 38
Notice it just states that this is a data recipe in which we have specified 1 outcome variable and 39 predictors.
We can prep
this recipe to learn more
prep(rec)
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 38
##
## Training data contained 2841 data points and 2841 incomplete rows.
Notice we now get an additional message about how many rows are in the data, and how many of these rows contain missing (incomplete data). So the recipe is the blueprint, and we prep the recipe to get it to actually go into the data and conduct the operations. The dataset it has now, however, is just a placeholder than can be substituted in for any other dataset with an equivalent structure.
But of course modeling score as the outcome with everything else predicting it (as is) is not reasonable for multiple reasons. We have many ID variables, for one, and we also multipe categorical variables. For some methods (like tree-based models) it might be okay to leave these as is, but for others (like any model in the linear regression family) we’ll wan to encode them somehow (e.g., dummy code).
We can do these operations by adding steps to our recipe. In the first step, we’ll update the role of all the ID variables so they are not included among the predictors. In the second, we will dummy code all nominal variables.
recipe(score ~ ., train) %>%
rec <- update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_dummy(all_nominal())
When updating the roles, we can change the variable label (text passed to the new_role
argument) to be anything we want, so long as it’s not "predictor"
or "outcome"
.
Notice in the above I am also using helper functions to apply the operations to all variables of a specific type. There are five main helper functions: all_predictors()
, all_outcomes()
, all_nominal()
, all_numeric()
and has_role()
. You can use these together, including with negation (e.g., -all_outcomes
to specify the operation should not apply to the outcome variable(s)) to select any set of variables you want to apply the operation to.
Let’s try prep
ping this recipe
prep(rec)
## Error: Only one factor level in lang_cd
Uh oh! We have an error. Our recipe is trying to dummy code the lang_cd
variable, but it has only one level. It’s kind of hard to dummy-code a constant!
Luckily, we can expand our recipe to first remove any zero-variance predictors, like so
recipe(score ~ ., train) %>%
rec <- update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal())
The zv
part stands for “zero variance” and should take care of this problem. Let’s try again.
prep(rec)
## Data Recipe
##
## Inputs:
##
## role #variables
## id vars 6
## outcome 1
## predictor 32
##
## Training data contained 2841 data points and 2841 incomplete rows.
##
## Operations:
##
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]
Beautiful! Note we do still get a warning here, but I’ve omitted it in the text (we’ll take care of it later). Our recipe says we now have 6 ID variables, 1 outcome, and 33 predictors, with 2841 data points (rows of data). The calc_admn_cd
and lang_cd
variables have been removed because they have zero variance, and several variables have been dummy coded, including gndr
and ethnic_cd
, among others.
Let’s dig just a bit deeper here though. What’s going on with these zero-variance variables? Let’s look back at the training data.
%>%
train count(calc_admn_cd)
## [90m# A tibble: 1 x 2[39m
## calc_admn_cd n
## [3m[90m<lgl>[39m[23m [3m[90m<int>[39m[23m
## [90m1[39m [31mNA[39m [4m2[24m841
%>%
train count(lang_cd)
## [90m# A tibble: 2 x 2[39m
## lang_cd n
## [3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m
## [90m1[39m S 80
## [90m2[39m [31mNA[39m [4m2[24m761
So at least in our sample, calc_admn_cd
really is just fully missing, which means it might as well be dropped because it’s providing us exactly nothing. But that’s not the case with lang_cd
. It has two values, NA
and S
. This variable represents the language the test was administered in and the NA
values are actually meaningful here because they are the the “default” administration, meaning English. So rather than dropping these, let’s mutate
them to transform the NA
values to "E"
for English. We could reasonably do this inside or outside the recipe, but a good rule of thumb is, if it can go in the recipe, put it in the recipe. It can’t hurt, and doing operations outside of the recipe risks data leakage.
recipe(score ~ ., train) %>%
rec <- update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_mutate(lang_cd = ifelse(is.na(lang_cd), "E", lang_cd)) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal())
Let’s take a look at what our data would actually look like when applying this recipe now. First, we’ll prep the recipe
prep(rec)
prepped <- prepped
## Data Recipe
##
## Inputs:
##
## role #variables
## id vars 6
## outcome 1
## predictor 32
##
## Training data contained 2841 data points and 2841 incomplete rows.
##
## Operations:
##
## Variable mutation for lang_cd [trained]
## Zero variance filter removed calc_admn_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]
And we see that lang_cd
is no longer being caught by the zero variance filter. Next we’ll bake
the recipe to actually apply it to our data. If we specify new_data = NULL
, bake
will apply the operation to the data we specified in the recipe. But we can also pass new data as an additional argument and it will apply the operations to that data instead of the data specified in the recipe.
bake(prepped, new_data = NULL)
## [90m# A tibble: 2,841 x 106[39m
## id attnd_dist_inst… attnd_schl_inst… enrl_grd partic_dist_ins…
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m [4m6[24m[4m2[24m576 [4m2[24m083 [4m1[24m353 7 [4m2[24m083
## [90m 2[39m [4m7[24m[4m1[24m424 [4m2[24m180 878 6 [4m2[24m180
## [90m 3[39m [4m1[24m[4m7[24m[4m9[24m893 [4m2[24m244 [4m1[24m334 3 [4m2[24m244
## [90m 4[39m [4m1[24m[4m3[24m[4m6[24m083 [4m2[24m142 [4m4[24m858 5 [4m2[24m142
## [90m 5[39m [4m1[24m[4m9[24m[4m6[24m809 [4m2[24m212 [4m1[24m068 3 [4m2[24m212
## [90m 6[39m [4m1[24m[4m3[24m931 [4m2[24m088 581 8 [4m2[24m088
## [90m 7[39m [4m1[24m[4m0[24m[4m3[24m344 [4m1[24m926 102 6 [4m1[24m926
## [90m 8[39m [4m1[24m[4m0[24m[4m5[24m122 [4m2[24m142 766 6 [4m2[24m142
## [90m 9[39m [4m1[24m[4m7[24m[4m2[24m543 [4m1[24m965 197 4 [4m1[24m965
## [90m10[39m [4m4[24m[4m5[24m153 [4m2[24m083 542 6 [4m2[24m083
## [90m# … with 2,831 more rows, and 101 more variables: partic_schl_inst_id [3m[90m<dbl>[90m[23m,[39m
## [90m# lang_cd [3m[90m<fct>[90m[23m, ncessch [3m[90m<dbl>[90m[23m, lat [3m[90m<dbl>[90m[23m, lon [3m[90m<dbl>[90m[23m, score [3m[90m<dbl>[90m[23m,[39m
## [90m# gndr_M [3m[90m<dbl>[90m[23m, ethnic_cd_B [3m[90m<dbl>[90m[23m, ethnic_cd_H [3m[90m<dbl>[90m[23m, ethnic_cd_I [3m[90m<dbl>[90m[23m,[39m
## [90m# ethnic_cd_M [3m[90m<dbl>[90m[23m, ethnic_cd_P [3m[90m<dbl>[90m[23m, ethnic_cd_W [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_bnch_X2B [3m[90m<dbl>[90m[23m, tst_bnch_X3B [3m[90m<dbl>[90m[23m, tst_bnch_G4 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_bnch_G6 [3m[90m<dbl>[90m[23m, tst_bnch_G7 [3m[90m<dbl>[90m[23m, tst_dt_X3.21.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X3.22.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X3.23.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X3.8.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X3.9.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.10.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.11.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.12.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.13.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.16.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.17.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.18.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.19.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.2.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.20.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.23.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.24.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.25.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.26.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.27.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.30.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.5.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X4.6.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X4.9.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.1.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.10.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.11.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.14.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.15.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.16.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.17.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.18.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.2.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.21.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.22.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.23.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.24.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.25.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.29.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.3.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.30.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.31.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.4.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.7.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X5.8.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X5.9.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X6.1.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X6.4.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X6.5.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X6.6.2018.0.00.00 [3m[90m<dbl>[90m[23m, tst_dt_X6.7.2018.0.00.00 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_dt_X6.8.2018.0.00.00 [3m[90m<dbl>[90m[23m, migrant_ed_fg_Y [3m[90m<dbl>[90m[23m, ind_ed_fg_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# sp_ed_fg_Y [3m[90m<dbl>[90m[23m, tag_ed_fg_Y [3m[90m<dbl>[90m[23m, econ_dsvntg_Y [3m[90m<dbl>[90m[23m, ayp_lep_B [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_lep_E [3m[90m<dbl>[90m[23m, ayp_lep_F [3m[90m<dbl>[90m[23m, ayp_lep_M [3m[90m<dbl>[90m[23m, ayp_lep_N [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_lep_S [3m[90m<dbl>[90m[23m, ayp_lep_W [3m[90m<dbl>[90m[23m, ayp_lep_X [3m[90m<dbl>[90m[23m, ayp_lep_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# stay_in_dist_Y [3m[90m<dbl>[90m[23m, stay_in_schl_Y [3m[90m<dbl>[90m[23m, dist_sped_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# trgt_assist_fg_Y [3m[90m<dbl>[90m[23m, ayp_dist_partic_Y [3m[90m<dbl>[90m[23m, ayp_schl_partic_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_dist_prfrm_Y [3m[90m<dbl>[90m[23m, ayp_schl_prfrm_Y [3m[90m<dbl>[90m[23m, rc_dist_partic_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# rc_schl_partic_Y [3m[90m<dbl>[90m[23m, rc_dist_prfrm_Y [3m[90m<dbl>[90m[23m, rc_schl_prfrm_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_atmpt_fg_Y [3m[90m<dbl>[90m[23m, grp_rpt_dist_partic_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# grp_rpt_schl_partic_Y [3m[90m<dbl>[90m[23m, grp_rpt_dist_prfrm_Y [3m[90m<dbl>[90m[23m, …[39m
And now we can actually see the dummy-coded categorical variables, along with the other operations we requested. For example, calc_admn_cd
is not in the dataset. Notice the ID variables are output though, which makes sense because they are often neccessary for joining with other data sources. But it’s important to realize that they are output (i.e., all variables are returned, regardless of role) because if we passed this directly to a model they would be included as predictors. Note that there may be reasons you would want to include a school and/or district level ID variable in your modeling, but you certainly would not want redundant variables.
We do still have one minor issue with this recipe though, which is pretty evident when looking at the column names of our baked dataset. The tst_dt
variable, which specifies the data the test was taken, was treated as a categorical variable because it read in as a character vector. That means all the dates are being dummy coded! Let’s fix this by just transforming it to a date within our step_mutate
.
recipe(score ~ ., train) %>%
rec <- update_role(contains("id"), ncessch, new_role = "id vars") %>%
step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
tst_dt = lubridate::mdy_hms(tst_dt)) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal())
And now when we prep
/bake
the dataset it’s still a date variable, which is what we probably want (it will modeled as a numeric variable).
%>%
rec prep() %>%
bake(new_data = NULL)
## [90m# A tibble: 2,841 x 55[39m
## id attnd_dist_inst… attnd_schl_inst… enrl_grd tst_dt
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dttm>[39m[23m
## [90m 1[39m [4m6[24m[4m2[24m576 [4m2[24m083 [4m1[24m353 7 2018-05-16 [90m00:00:00[39m
## [90m 2[39m [4m7[24m[4m1[24m424 [4m2[24m180 878 6 2018-04-24 [90m00:00:00[39m
## [90m 3[39m [4m1[24m[4m7[24m[4m9[24m893 [4m2[24m244 [4m1[24m334 3 2018-05-25 [90m00:00:00[39m
## [90m 4[39m [4m1[24m[4m3[24m[4m6[24m083 [4m2[24m142 [4m4[24m858 5 2018-05-24 [90m00:00:00[39m
## [90m 5[39m [4m1[24m[4m9[24m[4m6[24m809 [4m2[24m212 [4m1[24m068 3 2018-05-16 [90m00:00:00[39m
## [90m 6[39m [4m1[24m[4m3[24m931 [4m2[24m088 581 8 2018-06-06 [90m00:00:00[39m
## [90m 7[39m [4m1[24m[4m0[24m[4m3[24m344 [4m1[24m926 102 6 2018-06-04 [90m00:00:00[39m
## [90m 8[39m [4m1[24m[4m0[24m[4m5[24m122 [4m2[24m142 766 6 2018-05-08 [90m00:00:00[39m
## [90m 9[39m [4m1[24m[4m7[24m[4m2[24m543 [4m1[24m965 197 4 2018-05-23 [90m00:00:00[39m
## [90m10[39m [4m4[24m[4m5[24m153 [4m2[24m083 542 6 2018-05-10 [90m00:00:00[39m
## [90m# … with 2,831 more rows, and 50 more variables: partic_dist_inst_id [3m[90m<dbl>[90m[23m,[39m
## [90m# partic_schl_inst_id [3m[90m<dbl>[90m[23m, ncessch [3m[90m<dbl>[90m[23m, lat [3m[90m<dbl>[90m[23m, lon [3m[90m<dbl>[90m[23m,[39m
## [90m# score [3m[90m<dbl>[90m[23m, gndr_M [3m[90m<dbl>[90m[23m, ethnic_cd_B [3m[90m<dbl>[90m[23m, ethnic_cd_H [3m[90m<dbl>[90m[23m,[39m
## [90m# ethnic_cd_I [3m[90m<dbl>[90m[23m, ethnic_cd_M [3m[90m<dbl>[90m[23m, ethnic_cd_P [3m[90m<dbl>[90m[23m, ethnic_cd_W [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_bnch_X2B [3m[90m<dbl>[90m[23m, tst_bnch_X3B [3m[90m<dbl>[90m[23m, tst_bnch_G4 [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_bnch_G6 [3m[90m<dbl>[90m[23m, tst_bnch_G7 [3m[90m<dbl>[90m[23m, migrant_ed_fg_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# ind_ed_fg_Y [3m[90m<dbl>[90m[23m, sp_ed_fg_Y [3m[90m<dbl>[90m[23m, tag_ed_fg_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# econ_dsvntg_Y [3m[90m<dbl>[90m[23m, ayp_lep_B [3m[90m<dbl>[90m[23m, ayp_lep_E [3m[90m<dbl>[90m[23m, ayp_lep_F [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_lep_M [3m[90m<dbl>[90m[23m, ayp_lep_N [3m[90m<dbl>[90m[23m, ayp_lep_S [3m[90m<dbl>[90m[23m, ayp_lep_W [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_lep_X [3m[90m<dbl>[90m[23m, ayp_lep_Y [3m[90m<dbl>[90m[23m, stay_in_dist_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# stay_in_schl_Y [3m[90m<dbl>[90m[23m, dist_sped_Y [3m[90m<dbl>[90m[23m, trgt_assist_fg_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_dist_partic_Y [3m[90m<dbl>[90m[23m, ayp_schl_partic_Y [3m[90m<dbl>[90m[23m, ayp_dist_prfrm_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# ayp_schl_prfrm_Y [3m[90m<dbl>[90m[23m, rc_dist_partic_Y [3m[90m<dbl>[90m[23m, rc_schl_partic_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# rc_dist_prfrm_Y [3m[90m<dbl>[90m[23m, rc_schl_prfrm_Y [3m[90m<dbl>[90m[23m, lang_cd_E [3m[90m<dbl>[90m[23m,[39m
## [90m# tst_atmpt_fg_Y [3m[90m<dbl>[90m[23m, grp_rpt_dist_partic_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# grp_rpt_schl_partic_Y [3m[90m<dbl>[90m[23m, grp_rpt_dist_prfrm_Y [3m[90m<dbl>[90m[23m,[39m
## [90m# grp_rpt_schl_prfrm_Y [3m[90m<dbl>[90m[23m[39m
3.2.1 Order matters
It’s important to realize that the order of the steps matters. In our recipe, we first declare ID variables as having a different role than predictors or outcomes, we then modify two variables, remove zero-variance predictors, and finally dummy code all categorical (nominal) variables. What happens if we instead dummy code and then remove zero-variance predictors?
recipe(score ~ ., train) %>%
rec <- step_dummy(all_nominal()) %>%
step_zv(all_predictors())
prep(rec)
## Error: Only one factor level in lang_cd
We end up with the error, whereas we don’t if we remove zero variance predictors and then dummy code
recipe(score ~ ., train) %>%
rec <- step_zv(all_predictors()) %>%
step_dummy(all_nominal())
prep(rec)
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 38
##
## Training data contained 2841 data points and 2841 incomplete rows.
##
## Operations:
##
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]
This is true for all steps, and may occasionally lead to you needing to apply the same operation at multiple steps (e.g., a near zero variance filter could be applied before and after dummy-coding).
All of the above serves as a basic introduction to developing a recipe, and the what follows goes into more detail on specific feature engineering pieces. For complete documentation on all possible recipe steps, please see the documentaion.