Multiple regression

Multiple regression means that more than one independent variable is taken into consideration. By including multiple variables, the interpretation slightly changes. Therefore, it is not surprising that the coefficients may also change. The first two models are simple regressions, where the last model includes both variables. As you can see the effect of runtime on budget changes when you also include screens in the equation:

## (Intercept)     runtime 
## -102.454571    1.484581
## (Intercept)     screens 
## -10.4402624   0.2008701
## (Intercept)     runtime     screens 
## -89.0357469   0.8018946   0.1749951

Intercept

This change in effect can be explained to the change in interpretation. Let’s start with the intercept. As with simple regression, the intercept shows the mean level of Y where all Xs are zero. In the models above, this means in the simple model including runtime, the mean level of budget where runtime is zero is -102.45. As budget in the sample runs from 1.5 to 250, the budget cannot become negative. However, runtime does not include a zero (minimum value = 80). Therefore, the intercept shows a hypothetical value of budget, when you would draw the straigt line estimated by the slope to the point where x-axis is zero (and runtime would become hypothetically). Now, with two Xs, the intercept now represents the situation where both runtime and screens (minimum value = 7) are zero. This explains the change in intercept (from -102.45 to -89.04). The graphs below show the situation of simple regression of budget explained by runtime.

Controlling for a variable

By including multiple variables, the effect should be interpreted differently. The beta coefficient shows the effect of a 1 point increase in runtime on the mean level of budget, when the other X is held constant. This last part changes the beta coefficient of runtime in a simple and multiple regression and is referred to as controlling for a variable. To be precise, to hold constant means that the effect of a 1 point increase in runtime on budget is the same for all levels of the other independent variable (in this case opening screens in which the movie is shown).

Standardization

When you want to compare the effects of two independent variables measured at different scales, it makes sense to rescale them so that a 1 point increase means the same. This is done using standardization, using this formula: \[\frac{X_i-\bar{(X)}}{s_x}\] This consists of two parts: mean centering and dividing by the standard deviation. Mean centering is the numerator of the equation while the standard deviation is the denominator. For a more elaborated discussion on standardization and centering: stackoverflow

Note that all variables need to be standardized for standardized regression, not just the independent variables. With standardization, the effect of runtime now still represents a 1 point increase in runtime, however, 1 point now is the same as 1 standard deviation. Thus, a 1 standard deviation increase in runtime results in a .25 standard deviation in budget. Now the effects of runtime and screens can be compared: screens has the larger effect, a 1 standard deviation increase results in a higher standard deviation increase in budget than a 1 standard deviation increase in runtime.

First we need to remove all the missing, because mean centering has to be done on the same set of variables. If we mean center each variable separately, we calculate the mean across 484 observations of budget and runtime, but 439 observations of screens. By listwise deletion before mean centering, the means are calculated across the 439 complete observations for each variable.

imdb2<-na.omit(imdb[,c("budget","runtime","screens")])
##   (Intercept)      zruntime      zscreens 
## -4.158816e-17  2.509043e-01  6.205415e-01

When you run a simple regression between two standardized interval variables, the regression coefficient is the correlation coefficient (below you can find the result using the cor function. Note that the cor function has a different way to deal with missing values as you can calculate correlation coefficients of multiple variables with pairwise or listwise deletion.

##    (Intercept) imdb2$zruntime 
##  -1.357198e-16   4.778320e-01
## [1] 0.477832

Moderation

If the coefficients change a lot when including another variable, it is very likely that this holding constant is too strong. For example, the effect of runtime is very different for those movies that are shown in many screens on opening night. This can be tested by moderation, i.e. testing the interaction effect of runtime and screens on budget. The interaction effect shows how runtime and screens together have an effect on budget and can be tested by including an interaction term. This is calculated by multiplying runtime and screens, either by creating a new variable or directly in the lm function. You still need to include runtime and screens separately, as you want to know the separate influences and the joint influence of runtime and screens is. Note that you do not necessarily need to standardize your variables for moderation.

##     (Intercept)         runtime         screens runtime:screens 
##   -10.184585529     0.088856741    -0.009507462     0.001616179

Centering

With moderation the interpretation of the beta coefficients of the main effects, i.e. of runtime and screens separately, changes. Now the coefficient of runtime shows the effect where the other X is exactly zero. Thus, the fact that runtime became insignificant means that runtime has no effect on budget separately from screens. The interaction effect shows the effect on budget when both runtime and screens increase both with 1 point. This effect is significant but the effect sizes are now much smaller than before. To interpret the main effects, i.e. when the other X is exactly zero, we should rescale the variables so that zero has a meaningful number. In this case we will rescale to the mean, i.e. mean centering, by subtracting the mean from each value of X. This is done for all independent variables. Note that the interaction effect should be calculated using the rescaled variables.

##       (Intercept)          mruntime          mscreens mruntime:mscreens 
##      61.688513374       0.685753398       0.168158147       0.001616179

Now the coefficients are easier to interpret. The effect of runtime is now the 1 point increase when at a mean level of screens, and the other way around. Usually textbooks argue that main effects should not be interpreted when a moderation effect is estimated. This is said because the main effects, i.e. the separate effect of runtime, only makes sense when in principle there could be an effect of runtime when screens is zero. In some cases this is nonsensical (for instance when the other X represents time, and you want to measure the effect of a treatment, see analysisfactor). Note that standardization usually already implies mean centering.