3.8 Interactions
An interaction occurs when the slope, or relation, between one variable and the outcome depends upon a second variable. This is also often referred to as moderation or a moderating variable. For example, consider data that look like this
read_csv(here::here("data", "benchmarks.csv"))
benchmarks <-
ggplot(benchmarks, aes(rdg_fall, rdg_spr)) +
geom_point(color = "gray70") +
geom_smooth(aes(color = ell),
method = "lm")
where ell stands for English language learner. The plot above shows the relation between students scores on a reading assessment administered in the fall and the spring by ELL status. Notice that the slope for Non-ELL is markedly steeper than the other two groups. ELL status thus moderates the relation between the fall and spring assessment scores. If we’re fitting a model within a linear regression framework, we’ll want to make sure that we model the slopes separately for each of these groups (otherwise we would be leaving out important structure in the data, and our model performance would suffer). Interactions are powerful and important to consider, particularly if there is strong theory (or empirical evidence) suggesting that an interaction exists. However, they are not needed in many frameworks and can be derived from the data directly (e.g., tree-based methods, where the degree of possible interaction is determined by the depth of the tree). In this section, we therefore focus on interactions within a linear regression framework.
3.8.1 Creating interactions “by hand”
Even with base R, modeling interactions is pretty straightforward. In the above, we would just specify something like
lm(rdg_spr ~ rdg_fall*ell, data = benchmarks)
or equivalently
lm(rdg_spr ~ rdg_fall + ell + rdg_fall:ell, data = benchmarks)
However, this is doing some work under the hood for you which might go unnoticed. First, it’s dummy-coding ell
, then it’s creating, in this case, two new variables that are equal to \(ELL_{Monitor} \times Rdg_{fall}\) and \(ELL_{Non-ELL} \times Rdg_{fall}\) (with \(ELL_{Active}\) set as the reference group by default).
Let’s try doing this manually. First we need to dummy code ell
. To make things a bit more clear, I’ll only select the variables here that we’re using in our modeling.
benchmarks %>%
dummies <- select(rdg_fall, ell, rdg_spr) %>%
mutate(ell_monitor = ifelse(ell == "Monitor", 1, 0),
ell_non = ifelse(ell == "Non-ELL", 1, 0),)
dummies
## [90m# A tibble: 174 x 5[39m
## rdg_fall ell rdg_spr ell_monitor ell_non
## [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m 181 Active 194 0 0
## [90m 2[39m 166 Non-ELL 159 0 1
## [90m 3[39m 216 Non-ELL 198 0 1
## [90m 4[39m 203 Non-ELL 204 0 1
## [90m 5[39m 198 Active 173 0 0
## [90m 6[39m 188 Active 173 0 0
## [90m 7[39m 202 Monitor 200 1 0
## [90m 8[39m 182 Active 206 0 0
## [90m 9[39m 194 Non-ELL 191 0 1
## [90m10[39m 170 Active 185 0 0
## [90m# … with 164 more rows[39m
Next, we’ll multiply each of these dummy variables by rdg_fall
.
dummies %>%
interactions <- mutate(fall_monitor = rdg_fall * ell_monitor,
fall_non = rdg_fall * ell_non)
interactions
## [90m# A tibble: 174 x 7[39m
## rdg_fall ell rdg_spr ell_monitor ell_non fall_monitor fall_non
## [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m 181 Active 194 0 0 0 0
## [90m 2[39m 166 Non-ELL 159 0 1 0 166
## [90m 3[39m 216 Non-ELL 198 0 1 0 216
## [90m 4[39m 203 Non-ELL 204 0 1 0 203
## [90m 5[39m 198 Active 173 0 0 0 0
## [90m 6[39m 188 Active 173 0 0 0 0
## [90m 7[39m 202 Monitor 200 1 0 202 0
## [90m 8[39m 182 Active 206 0 0 0 0
## [90m 9[39m 194 Non-ELL 191 0 1 0 194
## [90m10[39m 170 Active 185 0 0 0 0
## [90m# … with 164 more rows[39m
As would be expected, these values are zero if they are not for the corresponding group, and are otherwise equal to rdg_fall
. If we enter all these variables in our model, then our model intercept will represent the intercept for the active group, with the corresponding slope estimated by rdg_fall
. The ell_monitor
and ell_non
terms represent the intercepts for the monitor and non-ELL groups, respectively, and each of these slopes are estimated by the corresponding interaction. Let’s try.
lm(rdg_spr ~ rdg_fall + ell_monitor + ell_non +
m_byhand <- fall_monitor + fall_non,
data = interactions)
summary(m_byhand)
##
## Call:
## lm(formula = rdg_spr ~ rdg_fall + ell_monitor + ell_non + fall_monitor +
## fall_non, data = interactions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.812 -7.307 -0.100 8.616 26.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.29666 32.95228 5.471 1.6e-07 ***
## rdg_fall 0.03939 0.18459 0.213 0.8313
## ell_monitor -0.47286 53.18886 -0.009 0.9929
## ell_non -64.12123 36.80395 -1.742 0.0833 .
## fall_monitor 0.06262 0.28669 0.218 0.8274
## fall_non 0.40801 0.20346 2.005 0.0465 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 168 degrees of freedom
## Multiple R-squared: 0.2652, Adjusted R-squared: 0.2433
## F-statistic: 12.13 on 5 and 168 DF, p-value: 4.874e-10
And we can verify that this does, indeed, get us the same results that we would have obtained with the shortcut syntax with base R.
summary(
lm(rdg_spr ~ rdg_fall*ell, data = benchmarks)
)
##
## Call:
## lm(formula = rdg_spr ~ rdg_fall * ell, data = benchmarks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.812 -7.307 -0.100 8.616 26.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.29666 32.95228 5.471 1.6e-07 ***
## rdg_fall 0.03939 0.18459 0.213 0.8313
## ellMonitor -0.47286 53.18886 -0.009 0.9929
## ellNon-ELL -64.12123 36.80395 -1.742 0.0833 .
## rdg_fall:ellMonitor 0.06262 0.28669 0.218 0.8274
## rdg_fall:ellNon-ELL 0.40801 0.20346 2.005 0.0465 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 168 degrees of freedom
## Multiple R-squared: 0.2652, Adjusted R-squared: 0.2433
## F-statistic: 12.13 on 5 and 168 DF, p-value: 4.874e-10
3.8.2 Creating interactions with {recipes}
Specifying interactions in a recipe is similar to other steps, with one small exception, which is that the use of the tilde, ~
, is required. Note that, once again, order does matter, and you should always do your dummy coding before specifying interactions with categorical variables. In the example before, this would correspond to
recipe(rdg_spr ~ rdg_fall + ell, benchmarks) %>%
step_dummy(all_nominal()) %>%
step_interact(~rdg_fall:starts_with("ell")) %>%
prep() %>%
bake(new_data = NULL)
## [90m# A tibble: 174 x 6[39m
## rdg_fall rdg_spr ell_Monitor ell_Non.ELL rdg_fall_x_ell_Mo… rdg_fall_x_ell_N…
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m 181 194 0 0 0 0
## [90m 2[39m 166 159 0 1 0 166
## [90m 3[39m 216 198 0 1 0 216
## [90m 4[39m 203 204 0 1 0 203
## [90m 5[39m 198 173 0 0 0 0
## [90m 6[39m 188 173 0 0 0 0
## [90m 7[39m 202 200 1 0 202 0
## [90m 8[39m 182 206 0 0 0 0
## [90m 9[39m 194 191 0 1 0 194
## [90m10[39m 170 185 0 0 0 0
## [90m# … with 164 more rows[39m
There are a few things to mention here. First, the data do look identical (minus the column names) to what we created by hand, so that’s good news. However, we also used starts_with
here, rather than just specifying something like step_interact(~rdg_fall:ell)
. This is because after dummy coding, the column names change (as you can see in the above). By specifying starts_with
, we are ensuring that we get all the interactions we need. If we forget this step, we end up with the wrong output, along with a warning.
recipe(rdg_spr ~ rdg_fall + ell, benchmarks) %>%
step_dummy(all_nominal()) %>%
step_interact(~rdg_fall:ell) %>%
prep() %>%
bake(new_data = NULL)
## [90m# A tibble: 174 x 4[39m
## rdg_fall rdg_spr ell_Monitor ell_Non.ELL
## [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
## [90m 1[39m 181 194 0 0
## [90m 2[39m 166 159 0 1
## [90m 3[39m 216 198 0 1
## [90m 4[39m 203 204 0 1
## [90m 5[39m 198 173 0 0
## [90m 6[39m 188 173 0 0
## [90m 7[39m 202 200 1 0
## [90m 8[39m 182 206 0 0
## [90m 9[39m 194 191 0 1
## [90m10[39m 170 185 0 0
## [90m# … with 164 more rows[39m
This is because its trying to find a column called ell
to multiply with rdg_fall
, but no such variable exists after dummy coding.