Simple regression

Simple regression (simple means only one independent variable) in a linear regression equation. Linear regression is used to estimate how variables are related to another variable. You can explore this relationship visually by a scatterplot. Although regression cannot test for causality, usually one variable is taken as the independent variable (also called explanatory variable or predictor). This variable is argued to explain the variance of the dependent variable. In a scatterplot, the dependent variable is displayed on the y-axis. A scatterplot is created by plot for which the first argument is the independent variable, and the second the dependent variable.

plot(imdb$runtime,imdb$budget)

Coefficients

Here we use the function lm to estimate the betacoefficient and intercept. The data=imdb argument will put the data name before each variable as in imdb$. If you want to test only part of the data you can also use a subsetting command such as data=imdb[imdb$budget>85,] to only run the regression for the high budget movies. If you use summary in front of lm you get a more complete overview, with standard errors, significance levels and the explained variance. If you save the solution you can get the coefficients (and other information) from the solution. Use str (str stands for structure) to see all the information behind the solution. summary.lm shows how every element in the summary is calculated.

The beta coefficient is unstandardized, meaning that a 1 point increase in runtime leads to a .18 point decrease in budget. The intercept shows the level of the dependent variable, budget, when the independent variables, runtime, is zero. As runtime starts from 80, the intercept is a hypothetical value. The budget would be 79.78 when the continuous line would be drawn to the point where the x-axis (i.e. runtime) reaches zero.

summary(imdb$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    80.0    97.0   107.0   109.7   120.0   180.0
summary(imdb$budget)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.50   21.00   40.00   60.33   80.00  250.00
summary(lm(budget~runtime, data=imdb))
## 
## Call:
## lm(formula = budget ~ runtime, data = imdb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.62 -33.43 -13.61  23.34 170.09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -102.4546    14.1601  -7.235 1.84e-12 ***
## runtime        1.4846     0.1276  11.638  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.54 on 482 degrees of freedom
## Multiple R-squared:  0.2194, Adjusted R-squared:  0.2177 
## F-statistic: 135.4 on 1 and 482 DF,  p-value: < 2.2e-16
fit<-summary(lm(budget~runtime, data=imdb))
fit$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -102.454571 14.1600546 -7.235464 1.839374e-12
## runtime        1.484581  0.1275605 11.638248 9.246340e-28

Interpretation R2

The multiple R2 refers to the explained variance calculated based on ratio of total variance and explained variance, and the adjusted R2 refers to the explained variance that is corrected for number of coefficients and sample size. The adjusted R2 can be calculated by 1-(1-R2) * ((n-1)/(n-k-1)) where n is sample size, and k is number of independent variables. In the case of a simple regression, the multiple R2 is the same as the squared standardized beta coefficient:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.7130 -0.7307 -0.1531  0.0000  0.5977  4.0630
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0720 -0.7167 -0.3705  0.0000  0.3584  3.4560
## 
## Call:
## lm(formula = zbudget ~ zruntime, data = imdb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6513 -0.6091 -0.2479  0.4252  3.0993 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.894e-17  4.020e-02    0.00        1    
## zruntime     4.684e-01  4.024e-02   11.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8844 on 482 degrees of freedom
## Multiple R-squared:  0.2194, Adjusted R-squared:  0.2177 
## F-statistic: 135.4 on 1 and 482 DF,  p-value: < 2.2e-16
fit<-summary(lm(zbudget~zruntime, data=imdb))
fit$coefficients[2,1]*fit$coefficients[2,1] # multiple R2
## [1] 0.2193685
1-(1-fit$r.squared) * ((nrow(imdb)-1)/(nrow(imdb)-1-1)) # adjusted R2 
## [1] 0.2177489

For more information, see stackoverflow on this topic.

Including estimates in plot

A straight line is estimated to account for the variability among the two variables. The slope is described by the beta coefficient. The intercept describes where the line starts. I have now stretched the x-axis to start at the hypothetical value of zero so that the point where the line crosses the y-axis now represents the intercept value (79.78). To add the line of a simple regression, you can use abline with the solution of lm. The extra line legend adds the intercept and beta coefficient to the plot. The paste command combines text with the vlaue. The round command makes sure only three decimals are shown.

fit<-lm(budget~runtime, data=imdb)
plot(imdb$runtime,imdb$budget, xlim=c(0,max(imdb$budget)))
abline(fit,lty="dashed", col="red")
legend("topleft", legend=c(paste("intercept=",round(fit$coeff[1],digits=3)),paste("beta=",round(fit$coeff[2],digits=3))))

Simple regression with dummy variable

If you include a dummy variable (i.e. nominal variable with two levels numbered 0 and 1) in the simple regression, you will get the same results as in a t-test. This is because the intercept is the mean level of the dependent variable where the independent variable is zero, and the beta coefficients shows the impact on the dependent variable when the independent variable increases by 1. This increase by 1 is exactly going from one level on the dummy variable to the other level. In the following example, I will compare the mean level of budget for Action and Comedy movies using a variable labeled dummy (1 for Action, 0 for Comedy movies).

summary(lm(budget~dummy,data=imdb))
## 
## Call:
## lm(formula = budget ~ dummy, data = imdb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.43 -32.43 -10.16  27.57 157.57 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.160      5.139   6.647 1.53e-10 ***
## dummy         58.268      6.496   8.970  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.16 on 284 degrees of freedom
##   (198 observations deleted due to missingness)
## Multiple R-squared:  0.2208, Adjusted R-squared:  0.218 
## F-statistic: 80.46 on 1 and 284 DF,  p-value: < 2.2e-16

The intercept, i.e. the mean level of budget for Comedy movies, is 62.24. The beta shows the increase in budget for a 1 point increase in dummy, i.e. the Action movies. Thus, Action movies have a mean level of 62.24 - 3.84. We can check this using t.test:

t.test(imdb$budget~imdb$dummy, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  imdb$budget by imdb$dummy
## t = -8.97, df = 284, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -71.05362 -45.48150
## sample estimates:
## mean in group 0 mean in group 1 
##        34.15981        92.42737
62.24 - 3.84
## [1] 58.4

Note that the argument var.equal=TRUE indicates that the pooled variance should be used which means the Student’s t-test is reported. The Student’s t-test assumes equal variances, however, if this assumption is not met, the Student’s t-test is inaccurate. Therefore, the Welch’s test is preferred in many situations (especially when size and variances differ between groups wikipedia-Welch). In the t.test function with var.equal=FALSE, which is the default, the Welch approach to the t-value and degrees of freedom is reported. This gives a different t-value and p-value than the linear model.