4 Categorical predictors

So far we have considered examples in which we regress a continuous outcome variable (e.g., Math Achievement) on one or more continuous predictors (SES, Approaches to Learning). In this chapter we consider how regression can be used with categorical predictors.

In an experimental context, the canonical example of a categorical predictor is treatment status (e.g., 1 = treatment group, 0 = control group). Examples of other categorical predictors commonly used in education research include:

Geographical region / school district (Orange, CH-C, Wake, …)
Type of school (public, private, charter, religious)
Which classroom, teacher, or school a student was assigned to
Gender (if recorded as categorical)
Race / ethnicity (if recorded as categorical)
Free / reduced price lunch status
English language learner status
Individualized learning plan status
…

In regression, we deal with these types of predictors using an approach called contrast coding (also called effect coding or dummy coding). In particular, we will address three types of contrast coding:

The special case of a single binary predictor
Reference-group coding (called “treatment” contrasts in R)
Deviation coding (called “sum” contrasts in R)

There are other types of contrast coding out there. We do not attempt an exhaustive review. However, the techniques we use in this chapter can be used with other types of contrast coding.

Along the way we will see that regression with a categorical predictor encompasses the independent samples t-tests of means and one-way ANOVA, both of which we discussed last semester. In the next chapter we will address how to combine categorical and continuous predictors in the same model.

4.1 Focus on interpretation

Categorical predictors can be challenging to understand because, depending on the contrast coding used, the model results can appear quite different.

For example, the two models below use the same data and the same variables (Math Achievement regressed on Urbanicity), but their regression coefficients have different values. Why? Because the models used different contrast coding for Urbanicity. In the first model, Urban and Suburban students are compared to Rural students. In the second model, Rural and Suburban students are compared to the unweighted average across all three groups. We talk in more detail about these interpretations in the following sections. For now, the main thing to notice is just that “different contrast coding = different results”.

Code

load("NELS.RData")
attach(NELS)

# run model with default contrast (treatment / dummy coding)
egA <- lm(achrdg08 ~ urban)

# change to sum / deviation contrasts and run again
contrasts(urban) <- contr.sum(n = 3)
colnames(contrasts(urban)) <- c("Rural", "Suburban")
egB <- lm(achrdg08 ~ urban)

# print
summary(egA)


Call:
lm(formula = achrdg08 ~ urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    54.9927     0.6885  79.876  < 2e-16 ***
urbanSuburban   0.6675     0.9117   0.732  0.46439    
urbanUrban      3.1275     1.0480   2.984  0.00298 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.0151 
F-statistic: 4.824 on 2 and 497 DF,  p-value: 0.008411

Code

summary(egB)


Call:
lm(formula = achrdg08 ~ urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    56.2577     0.4021 139.897   <2e-16 ***
urbanRural     -1.2650     0.5654  -2.237   0.0257 *  
urbanSuburban  -0.5975     0.5299  -1.128   0.2600    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.0151 
F-statistic: 4.824 on 2 and 497 DF,  p-value: 0.008411

Code

# Reset to original contrasts
contrasts(urban) <- contr.treatment(n = 3)
colnames(contrasts(urban)) <- c("Suburban", "Urban")

Note that the lm output doesn’t explicitly tell us what kind of coding was used. So, we need to know what is going on “under the hood” in order to interpret the output correctly. Also note that the while the Coefficients tables of the two models are different, they both explain the same amount of variance in the outcome. So, we could say that both models “fit” the data equally well, but they provide different interpretations of the relationship between Math Achievement and Urbanicity.

The choice of which contrast to use is up to the researcher. This choice should reflect the research questions you want to address. For example, if we wanted to know how Urban and Suburban students differ from Rural students, that is what the first output shows us. If we wanted to know how Suburban and Rural students differ from the overall average, that is what the second output shows us. As we will discuss below, we can’t make all possible comparisons among groups in a single model, so we have to make some compromises when choosing contrasts. Understanding how to align our research questions with the choice of contrasts is a big part of what this chapter is about!

4.2 Data and social constructs

Before getting into the math, let’s consider some conceptual points.

First, some terminology. Binary means that a variable can take on only two values: 1 and 0. If a variable takes on two values but these are represented with other numbers (e.g., 1 and 2) or with non-numeric values (“male”, “female”), it is called dichotomous rather than binary. Otherwise stated, a binary variable is a dichotomous variable whose values are 1 and 0. The term categorical is used to describe variables with two or more categories – it includes binary and dichotomous variables, as well as variables with more categories.

Note that encoding a variable as dichotomous does not imply that the underlying social construct is dichotomous. For example, we can encode educational attainment as a dichotomous variable indicating whether or not a person has graduated high school. This does not imply that educational attainment has only two values “irl”, or even that educational attainment is best conceptualized in terms of years of formal education. Nonetheless, for many outcomes of interest it can be meaningful to consider whether individuals have completed high school (e.g., https://www.ssa.gov/policy/docs/research-summaries/education-earnings.html).

In general, the way that a variable is encoded in a dataset is not a statement about reality – it reflects a choice made by researchers about how to represent social constructs. When conducting quantitative research, we are often faced with less-than-ideal encodings of social constructs, especially demographic variables. For example, both NELS and ECLS conceptualize gender as dichotomous and use a limited set of mutually exclusive categories for race. These representations are not well aligned with current literature on gender and racial identity. Some recent perspectives on how these issues play into quantitative research are available here: https://www.sree.org/critical-perspectives.

Despite the conceptual issues inherent in categorizing social constructs, I would argue that categorical variables often have utility, especially in the study of social inequality. Here is an example of why I think gender qua “male/female” is a flawed but important consideration in global education: https://www.unicef.org/education/girls-education.

Please take a moment to write down your thoughts on the tensions that arise when conceptualizing social constructs such as gender or race as categorical, and I will invite you to share you thoughts in class.

4.3 Binary predictors

Let’s start our interpretation of categorical predictors with the simplest case: a single binary predictor.

Figure 4.1 uses the NELS data to illustrate the regression of Reading Achievement in Grade 8 (achrdg08) on a binary encoding of Gender (female = 0, male = 1). There isn’t a lot going on the plot! However, we can see the conditional distributions of Reading Achievement for each value of Gender, and the means of the two groups are indicated.

Code

# Don't read this unless you really like working on graphics :) 
binary_gender <- (gender == "Male") * 1 
plot(binary_gender, achrdg08, 
     col = "#4B9CD3", 
     xlab = "binary gender (Female = 0, Male = 1)",
     ylab = "Reading (Grade 8)")

means <- tapply(achrdg08, binary_gender, mean)
labels <- c(expression(bar(Y)[0]), expression(bar(Y)[1]))

text(x = c(.1, .9), y = means, labels = labels, cex = 1.5) 
text(x = c(0, 1), y = means, labels = c("_", "_"), cex = 2)

Figure 4.1: Reading Achievement on Binary Gender.

We can represent the relationship between Reading Achievement and Gender using the same simple regression equation from Section 2.2

\[\widehat Y = b_0 + b_1 X, \]

and we can still interpret the regression slope in terms of a one-unit increase in $X$. However, since $X$ can only take on two values (0 or 1), there are also other interpretations of the regression coefficients. In this section we are interested in the these “more specific” interpretations of regression coefficients when $X$ is binary.

The general strategy for approaching the interpretation of regression coefficients with categorical predictors has two steps:

Step 1. Plug the values for $X$ into the regression equation to get the predicted values for each group \[ \begin{align} \widehat Y (Female) & = b_0 + b_1 (0) = b_0 \\ \widehat Y (Male) & = b_0 + b_1 (1) = b_0 + b_1 \end{align} \]
Step 2. Solve for the coefficients in terms the predicted values.

\[ b_0 = \widehat Y (Female) \tag{4.1}\] \[ b_1 = \widehat Y (Male) - b_0 = \widehat Y (Male) - \widehat Y (Female) \tag{4.2}\]

Looking at Equation 4.1 we can see that intercept ($b_0$) is equal to the predicted value of Reading Achievement for Females, and Equation 4.2 shows that the regression slope ($b_1$) is equal to the difference between predicted Reading Achievement for Males and Females.

There is one last detail that is important for interpreting these equations: for a single categorical predictor, the predicted values for each category are just the group means on the $Y$ variable (see Section 4.7, which is optional). So, using the notation of Figure 4.1, we can re-write Equation 4.1 and Equation 4.2 as

\[b_0 = \bar Y_0 \tag{4.3}\]
\[ b_1 = \bar Y_1 - \bar Y_0 \tag{4.4}\]

We will use the equivalence between predicted values and group means throughout this chapter. However, it is important to note this equivalence holds only when there is single categorical predictor in the model, and no other predictors. Additional predictors are discussed in Section 5.2

For the example data, regression coefficients are:

Code

# convert "Female / Male" coding to binary
gender <- NELS$gender
binary_gender <- (gender == "Male")*1
mod_binary <- lm(achrdg08 ~ binary_gender)
coef(mod_binary)

  (Intercept) binary_gender 
   56.4678022    -0.9223396

Please take a moment and write down how these two numbers are related to Figure 4.1. In particular, what is $\bar Y_0$ equal to, what is $\bar Y_1$ equal to, and what is their difference equal to?

4.3.1 Relation with t-tests of means

Simple regression with a binary predictor is equivalent to conducting an independent samples t-test in which the $X$ variable (Gender) is the grouping variable and the $Y$ variable (Reading Achievement) is the outcome. The following output illustrates this.

For the regression model (same as above):

Code

summary(mod_binary)


Call:
lm(formula = achrdg08 ~ binary_gender)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.7278  -6.1472   0.3784   6.9765  15.0045 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    56.4678     0.5342 105.703   <2e-16 ***
binary_gender  -0.9223     0.7928  -1.163    0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.827 on 498 degrees of freedom
Multiple R-squared:  0.00271,   Adjusted R-squared:  0.0007076 
F-statistic: 1.353 on 1 and 498 DF,  p-value: 0.2452

For the independent samples t-test (with homogeneity of variance assumed):

Code

t.test(achrdg08 ~ binary_gender, var.equal = T)


    Two Sample t-test

data:  achrdg08 by binary_gender
t = 1.1633, df = 498, p-value = 0.2452
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.6353793  2.4800586
sample estimates:
mean in group 0 mean in group 1 
       56.46780        55.54546

We will go through the comparison between these two outputs in class together. If you have any questions about the relation between these two sets of output, please note them now and be prepared ask them in class.

4.3.2 Summary

When doing regression with a binary predictor:

The regression intercept is equal to the mean of the group coded “0”.
The regression slope is equal to the mean of the group coded “0” subtracted from the mean of the groups coded “1”.
Testing $H_0: b_1 = 0$ is equivalent to testing the mean difference $H_0: \mu_1 – \mu_0 = 0$
- i.e., regression with a binary variable is the same as a t-test of means for independent samples

4.4 Reference-group coding

Now that we know about regression with a binary predictor, let’s consider how to extend this approach to contrast coding of categorical predictors with $C ≥ 2$ categories.

The basic idea is to represent the $C$ categories of the predictor in terms of $C – 1$ “dummy variables.” Binary coding of a dichotomous predictor is one example of this: We represented a categorical variable with $C = 2$ categories using $1$ binary predictor.

For predictors with $C ≥ 2$ categories, the most common approach to contrast coding is called reference-group coding. In R, the approach is called treatment contrasts and is the default coding for categorical predictors.

It is called reference-group coding because:

The researcher chooses a reference-group.
The intercept is interpreted as the mean of the reference-group.
The $C – 1$ regression slopes are interpreted as the mean differences between the $C – 1$ other groups and the reference-group.

Note that reference-group coding is a generalization of binary coding. In the example from Section 4.3:

Females were the reference-group.
The intercept was equal to the mean Reading Achievement for females.
The regression coefficient was equal to the mean difference between males and females.

The rest of this section considers how to generalize this approach to more than 2 groups.

4.4.1 A hypothetical example

Figure 4.2 presents a toy example. The data show the Age and marital status (Mstatus) of 16 hypothetical individuals. Marital status is encoded as

Single
Married
Divorced

Code

knitr::include_graphics("files/images/marital_status1.png")

Figure 4.2: Toy Martital Status Example.

These 3 categories are represented by two binary variables, denoted $X_1$ and $X_2$.

$X_1$ is a binary variable that is equal to 1 when Mstatus is “married”, and equal to 0 otherwise.
$X_2$ is a binary variable that is equal to 1 when Mstatus is “divorced”, and equal to 0 otherwise.

The binary variables are often called dummies or indicators. For example, $X_1$ is a dummy or indicator for married respondents.

In reference-group coding, the group that does not have a dummy variable is the reference group. It is also the group that is coded zero on all of the included dummies. In this example, “Single” is the reference group.

We can more compactly write down the contrast coding in the example using a contrast matrix .

\[ \begin{matrix} & X_1 & X_2\\ \text{Married} & 1 & 0 \\ \text{Divorced} & 0 & 1 \\ \text{Single} & 0 & 0 \\ \end{matrix} \]

This notation is an abbreviated version of the data matrix in Figure 4.2. It summarizes how the two indicator variables correspond to the different levels of the categorical variable. As we will see in the Section 4.9, contrast matrices are useful for programming in R.

4.4.2 Interpreting the regression coefficients

Regressing Age on the dummies we have:

\[ \widehat Y = b_0 + b_1 X_1 + b_2 X_2. \]

In order to interpret the regression coefficients we can use the same two steps as in Section 4.3

Step 1. Plug the values for the $X$ variables into the regression equation to get the predicted values for each group

\[\begin{align} \widehat Y (Single) & = b_0 + b_1 (0) + b_2 (0) = b_0 \\ \widehat Y (Married) & = b_0 + b_1 (1) + b_2 (0) = b_0 + b_1 \\ \widehat Y (Divorced) & = b_0 + b_1 (0) + b_2 (1) = b_0 + b_2 \end{align}\]

Step 2. Solve for the model parameters in terms the predicted values.

\[\begin{align} b_0 & = \widehat Y (Single) \\ b_1 & = \widehat Y (Married) - b_0 = \widehat Y (Married) - \widehat Y (Single) \\ b_2 & = \widehat Y (Divorced) - b_0 = \widehat Y (Divorced) - \widehat Y (Single) \end{align}\]

Using the above equations, please write down an interpretation of the regression parameters for the hypothetical example. (Note: this question is not asking for a numerical answer, it is just asking you to put the above equations into words.)

4.4.3 More than 3 categories

Figure 4.3 extends the example by adding another category for Mstatus (“widowed”).

Code

knitr::include_graphics("files/images/marital_status2.png")

Figure 4.3: Toy Martital Status Example, Part 2.

Here we are interested in the model

\[\widehat Y = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3\]

Please work through the following questions and be prepared to share your answers in class

How should $X_3$ be coded so that “single” is the reference-group?
Write down the contrast matrix for the example.
Using the two-step approach illustrated above, write out the interpretation of the regression coefficients.
Let’s imagine that we included an additional dummy variable for “Single” in the model. In this case we would have $C = 4$ dummies in the model rather than $C-1 = 3$ dummies. Please take a moment to write down what you think will happen (Hint: something goes wrong)

4.4.4 Summary

In reference-group coding with a single categorical variable:

The reference is group is chosen by the analyst – it is the group that is coded zero on all dummies, or the one that has its indicator omitted from the model.
The intercept is interpreted as the mean of the reference group.
The regression coefficients of the dummy variables are interpreted as the difference between the mean of the indicated group and the mean of the reference group.

4.5 Deviation coding

In some research scenarios, there is a clear reference group. For example, in an experiment, all treatment groups are can be compared to the control group. But in other cases, it is less clear what the reference group should be. In both of the examples we have considered so far (gender and marital status), the choice of reference group was arbitrary.

When there is not a clear reference group, it can be preferable to use other types of contrast coding than reference-group coding. One such approach is called deviation coding. In R, this is called sum-to-zero constrasts, or sum contrasts for short.

The main difference between deviation coding and reference-group coding is the interpretation of the intercept. In deviation coding, it is no longer the mean of the reference-group, but instead represents the mean of all groups’ predicted values. When the groups have equal sample size, the deviation-coded intercept is also equal to overall mean on $Y$.

More specifically, in deviation coding:

The intercept is equal to the mean of the predicted values for each category i.e.,

\[ b_0 = \frac{\sum_{c=1}^C \widehat Y_c} {C} \tag{4.5}\]

When the groups have equal sample sizes ($n$), the intercept is also equal to the overall mean on the outcome variable:

\[\begin{equation} b_0 = \frac{\sum_{c=1}^C \widehat Y_c} {C} = \frac{\sum_{c=1}^C \bar Y_c}{C} = \frac{\sum_{c=1}^C \left(\frac{\sum_{i=1}^n Y_{ic}}{n}\right)} {C} = \frac{\sum_{c=1}^C \sum_{i=1}^n Y_{ic}}{nC} = \bar Y \end{equation}\]

The regression slopes compare the each group to the intercept. Consequently, when the groups have equal sample sizes, the regression slopes are interpreted as the deviation of each group mean from the overall mean, which is why it is called deviation coding.

When the groups have unequal sample size, the situation is a bit more complicated. In particular, we have to weight the predicted values in Equation 4.5 by the group sample sizes. This is addressed in Section 4.5.4 (optional). To clarify that intercept in deviation coding is not always equal to $\bar Y$, we refer to it as an “unweighted mean” of group means / predicted scores.

It is important to note that there are still only $C-1$ regression slopes So, one group gets left out of the analysis, and the researcher has to chose which one. This is a shortcoming of deviation coding, which is addressed in Section 4.5.5 (optional).

4.5.1 A hypothetical example

The International Development and Early Learning Assessment (IDELA) is an assessment designed to measure young children’s development in literacy, numeracy, social-emotional, and motor domains, in international settings. Figure 4.4 shows the countries in which the IDELA had been used as of 2017 (for more info, see https://www.savethechildren.net/sites/default/files/libraries/GS_0.pdf).

Code

knitr::include_graphics("files/images/idela_map.png")

Figure 4.4: IDELA Worldwide Usage, 2017.

If our goal was to compare countries’ IDELA scores, agreeing on which country should serve as the reference group would politically fraught. Therefore, it would be preferable to avoid the problem of choosing a reference group altogether. In particular, deviation coding let’s us compare each country’s mean IDELA score to the (unweighted) mean over all of the countries.

Figure 4.5 presents a toy example. The data show the IDELA scores and Country for 16 hypothetical individuals. The countries considered in this example are

Ethiopia
Vietnam
Boliva

These 3 countries are represented by two binary variables, denoted $X_1$ and $X_2$.

$X_1$ is a indicator for Ethiopia
$X_2$ is a indicator for Vietnam

Code

knitr::include_graphics("files/images/idela1.png")

Note that the dummy variables are different than for the case of reference-group coding discussed in Section 4.4. In deviation coding, the dummies always take on values $1, 0, -1$. The same group must receive the code $-1$ for all dummies. The group with the value $-1$ is the group that gets left out of the analysis.

Please take a moment to write down the contrast matrix for the IDELA example.

4.5.2 Interpreting the regression coefficients

Regressing IDELA on the dummies we have:

\[\widehat Y = b_0 + b_1 X_1 + b_2 X_2\]

In order to interpret the regression coefficients, we proceed using the same two steps as in Section 4.3 and Section 4.4.

Step 1. Plug the values for $X$ into the regression equation to get the predicted values for each group.

\[\begin{align} \widehat Y (Ethiopia) & = b_0 + b_1 (1) + b_2 (0) = b_0 + b_1\\ \widehat Y (Vietnam) & = b_0 + b_1 (0) + b_2 (1) = b_0 + b_2 \\ \widehat Y (Bolivia) & = b_0 + b_1 (-1) + b_2 (-1) = b_0 - b_1 - b_2 \end{align}\]

Step 2. Solve for the model parameters in terms of the predicted values.

\[\begin{align} b_1 &= \widehat Y (Ethiopia) - b_0 \\ b_2 &= \widehat Y (Vietnam) - b_0 \\ b_0 & = \widehat Y (Bolivia) + b_1 + b_2 \\ \end{align}\]

At this point, we want to use the first two lines to substitute in for $b_1$ and $b_2$ in the last line:

\[\begin{align} b_0 & = \widehat Y (Bolivia) + \widehat Y (Ethiopia) - b_0 + \widehat Y (Vietnam) - b_0 \\ \implies & \\ 3b_0 & = \widehat Y (Bolivia) + \widehat Y (Ethiopia) + \widehat Y (Vietnam) \\ \implies & \\ b_0 & = \frac{\widehat Y (Bolivia) + \widehat Y (Ethiopia) + \widehat Y (Vietnam)}{3} \end{align}\]

In the last line we see that $b_0$ is equal to the (unweighted) mean of the predicted values.

4.5.3 Summary

In deviation coding with a single categorical variable:

The intercept is interpreted as the unweighted mean of the groups’ means, which is equal to the overall mean on the $Y$ variable when the groups have equal sample sizes.
The regression slopes on the dummy variables are interpreted as the difference between the mean of the indicated group and the unweighted mean of the groups.
There are still only $C - 1$ regression coefficients, so one group gets left out (see extra material for how to get around this).

4.5.4 Extra: Deviation coding with unequal sample sizes*

When groups have unequal sample size, the unweighted mean of the group means is not equial the overall mean of the $Y$ variable. This is not always a problem. In particular, in the IDELA example, it is reasonable that each country should receive equal weight, even if the size of the samples in each country differed.

However, if you want to compare each groups’ mean to the overall mean on $Y$, deviation coding can be adjusted by replacing the dummy-coded value $-1$ with the ratio of indicated group’s sample size to the omitted group’s sample size. An example for 3 groups is shown below.

\[ \begin{matrix} & \text{Dummy 1}& \text{Dummy 2}\\ \text{Group 1} & 1 & 0 \\ \text{Group 2} & 0 & 1 \\ \text{Group 3} & - n_1 /n_3 & - n_2 / n_3 \\ \end{matrix} \]

You can use the 2-step procedure to show that this coding, called weighted deviation coding, results in

\[\begin{equation} b_0 = \frac{n_1 \widehat Y( \text{Group 1}) + n_2 \widehat Y( \text{Group 2}) + n_3 \widehat Y( \text{Group 3})}{n_1 + n_2 + n_3} \end{equation}\]

Replacing $\widehat Y( \text{Group }c )$ with $\bar Y_c$ you can also show that $b_0 = \bar Y$, using the rules of summation algebra.

The punchline: unlike the case for the deviation coding, the intercept in weighted deviation coding is always equal to the overall mean on the outcome variance, regardless of the sample sizes of the groups. In R, you can use the package wec for weighted effect coding.

4.5.5 Extra: Deviation coding with all groups included*

Another shortcoming of deviation coding is that it requires leaving one group out of the model. As a work around, one can instead use the following approach. Note that this approach will affect the statistical significance of R-squared, so you should only use it if you aren’t interested in reporting R-squared (or are willing to do some more fiddling around with R).

Step 1: Center the $Y$ variable so that $\bar Y = 0$.
Step 2: Compute binary dummy variables (reference-group coding) for all $C$ groups, $X_1, X_2, \dots, X_C$
Step 3: Regress $Y$ on the all $C$ dummy variables, without the intercept in the model:

\[\hat Y = b_1X_1 + b_2 X_2 + \dots + b_cX_C.\]

Using the two step approach, it is easy to show that the regression coefficients are just the means of the indicated group. Since the overall mean of $Y$ is zero (because of Step 1), the group means can be interpreted as deviations from the overall mean on $Y$.

Using the lm function in R, you can remove the intercept from the model using -1 in the formula syntax:

Y ~ -1 + X1 + ...

As noted, this will affect the statistical significance of R-squared, as reported in summary(lm). So, you shouldn’t use this approach when you want to report R-squared. The next two paragraphs explain why, but the explanation requires some material we won’t discuss until Chapter 6.

The F-test of R-squared can be computed by comparing two different models, a reference model with no predictors and a focal model with the predictors. The reference model used by lm depends on whether the intercept is included in the focal model. If the focal model has an intercept, the reference model also has an intercept. In the focal model omits the intercept, the reference also omits the intercept.

In the present context, the focal model omits the intercept and use $C$ dummies This model has the same R-squared as if we had used the intercept and $C-1$ dummies. However, the F-statistic for the test of R-squared (as reported by summary(lm)) is different for the two approaches, because of the choice of reference model. You can change the reference model, but it takes a bit of fiddling around. If you are interested, just ask in class.

4.6 Relation with ANOVA

As mentioned in Chapter 3, the F-test of R-squared is equivalent to simultaneously testing whether all of the regression slopes are equal to zero. In symbols:

Testing

\[H_0: R^2 = 0\]

is equivalent to testing

\[H_0: b_1 = b_2 = \cdots = b_{C-1} = 0\]

We have see above that the regression slopes the regression coefficient of dummy variables can be interpreted in terms of group-mean differences. For example, under reference-group coding, we can re-write the null hypothesis above as:

\[H_0: \mu_1 - \mu_C = \mu_2 - \mu_C = \cdots = \mu_{C-1} - \mu_C = 0\] In this notation, we let $C$ denote the reference group. If we add $_C $ to each equality in the null hypothesis, we have

\[H_0: \mu_1 = \mu_2 = \cdots = \mu_{C-1} = \mu_C\]

Note that this is the same null hypothesis as one-way ANOVA – i.e., all the group means are equal. Consequently, the F-test of R-squared using a categorical predictor is equivalent to the F-test of the omnibus hypothesis in one-way ANOVA (assuming homoskedasticity). In Section 4.9, we show how the two approaches produce the same numerical values.

Although regression with a categorical predictor is mathematically equivalent to one-way ANOVA, you may have noticed that the two approahces are not used in the same way. In ANOVA, we first conduct an F-test of the omnibus hypothesis, and, if the test is significant, we follow-up by comparing the groups using procedures that control for familywise error rate. In regression, we just report the F-test along with the test of the individual contrasts, without any discussion of familywise error rate control. Which approach is correct? Depends who you ask….

4.7 More about categorical data*

Check back later for:

mean and variance of binary variables
covariance between binary variables and between binary and continuous
regression coefficient when X is binary

4.8 Workbook

This section collects the questions asked in this chapter. The lesson for this chapter will focus on discussing these questions and then working on the exercises in Section 4.9. The lesson will not be a lecture that reviews all of the material in the chapter! So, if you haven’t written down / thought about the answers to these questions before class, the lesson will not be very useful for you. Please engage with each question by writing down one or more answers, asking clarifying questions about related material, posing follow up questions, etc.

Section 4.2

Section 4.3

Please take a moment to write down how the regression output is related to the Figure. In particular, what is $\bar Y_0$ equal to, what is equal $\bar Y_0$ to, and what is their difference equal to?

Code

# Don't read this unless you really like working on graphics :) 
binary_gender <- (gender == "Male") * 1 
plot(binary_gender, achrdg08, 
     col = "#4B9CD3", 
     xlab = "binary gender (Female = 0, Male = 1)",
     ylab = "Reading (Grade 8)")

means <- tapply(achrdg08, binary_gender, mean)
labels <- c(expression(bar(Y)[0]), expression(bar(Y)[1]))

text(x = c(.1, .9), y = means, labels = labels, cex = 1.5) 
text(x = c(0, 1), y = means, labels = c("_", "_"), cex = 2)

Code

# convert "Female / Male" coding to binary
gender <- NELS$gender
binary_gender <- (gender == "Male")*1
mod_binary <- lm(achrdg08 ~ binary_gender)
coef(mod_binary)

  (Intercept) binary_gender 
   56.4678022    -0.9223396

Section 4.3.1

We will go through the comparison between these two outputs in class together. If you have any questions about the relation between these two sets of output, please note them now and be prepared ask them in class.

Regression with a binary predictor:

Code

summary(mod_binary)


Call:
lm(formula = achrdg08 ~ binary_gender)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.7278  -6.1472   0.3784   6.9765  15.0045 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    56.4678     0.5342 105.703   <2e-16 ***
binary_gender  -0.9223     0.7928  -1.163    0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.827 on 498 degrees of freedom
Multiple R-squared:  0.00271,   Adjusted R-squared:  0.0007076 
F-statistic: 1.353 on 1 and 498 DF,  p-value: 0.2452

Independent samples t-test (with homogeneity of variance assumed):

Code

t.test(achrdg08 ~ binary_gender, var.equal = T)


    Two Sample t-test

data:  achrdg08 by binary_gender
t = 1.1633, df = 498, p-value = 0.2452
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.6353793  2.4800586
sample estimates:
mean in group 0 mean in group 1 
       56.46780        55.54546

Section 4.4.2

Regressing Age on the dummies we have:

\[ \widehat Y = b_0 + b_1 X_1 + b_2 X_2 \]

In order to interpret the regression coefficients we can use the same two steps as in Section 4.3

Step 1. Plug the values for the $X$ variables into the regression equation to get the predicted values for each group

Step 2. Solve for the model parameters in terms the predicted values.

Section 4.4.3

The Mstatus example with 4 categories:

Code

knitr::include_graphics("files/images/marital_status2.png")

Here we are interested in the model:

\[\widehat Y = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3\]

Please work through the following questions and be prepared to share your answers in class

How should $X_3$ be coded so that “single” is the reference-group?
Write down the contrast matrix for the example.
Using the two-step approach illustrated above, write out the interpretation of the regression coefficients.
Let’s imagine that we included an additional dummy variable for “Single” in the model. In this case we would have $C = 4$ dummies in the model rather than $C-1 = 3$ dummies. Please take a moment to write down what you think will happen (Hint: something goes wrong).

Section 4.5.1

Please take a moment to write down the contrast matrix for the IDELA example:

\[ \begin{matrix} & X_1 & X_2 \\ \text{Ethiopia} & ? & ? \\ \text{Vietnam} &? & ? \\ \text{Bolivia} & ? & ? \\ \end{matrix} \]

Section 4.5.2

Regressing IDELA on the dummies we have:

\[\widehat Y = b_0 + b_1 X_1 + b_2 X_2\]

In order to interpret the regression coefficients, we proceed using the same two steps as in Section 4.3 and Section 4.4.

Step 1. Plug the values for $X$ into the regression equation to get the predicted values for each group.

Step 2. Solve for the model parameters in terms of the predicted values.

\[\begin{align} b_1 &= \widehat Y (Ethiopia) - b_0 \\ b_2 &= \widehat Y (Vietnam) - b_0 \\ b_0 & = \widehat Y (Bolivia) + b_1 + b_2 \\ \end{align}\]

At this point, we want to use the first two lines to substitute in for $b_1$ and $b_2$ in the last line, which leads to

\[\begin{align} b_0 & = \frac{\widehat Y (Bolivia) + \widehat Y (Ethiopia) + \widehat Y (Vietnam)}{3} \end{align}\]

In the last line we see that $b_0$ is equal to the (unweighted) mean of the predicted values.

4.9 Exercises

These exercises provide an overview of contrast coding with categorical predictors in R. Two preliminary topics are also discussed: linear regression with a binary predictor, and the factor data class in R. We will go through this material in class together, so you don’t need to work on it before class (but you can if you want.)

Before staring this section, you may find it useful to scroll to the top of the page, click on the “</> Code” menu, and select “Show All Code.”

4.9.1 A single binary predictor

Regression with a single binary predictor is equivalent to the independent samples t-test of group means. Let’s illustrate this using the NELS dataset with Reading achievement in grade 8 (achrdg08) as the outcome and gender as the predictor. Although gender need not be conceptualized as dichotomous or even categorical, the variable gender reported in NELS data is dichotomous, with values “Female” and “Male”.

To start with, let’s recode gender to binary_gender by setting Female = 0 and Male = 1. Below we use a little trick for doing this. First we create a logical vector (gender == "Male") and then coerce the logical vector to binary by multiplying it by 1. There are other ways to do this but this approach is pretty easy.

Code

# load("NELS.RData")
# attach(NELS)

# Note that gender is non-numeric -- it is a factor with levels "Female" and "Male"
head(gender)

[1] Male   Female Male   Female Male   Female
Levels: Female Male

Code

class(gender)

[1] "factor"

Code

# A trick to create a binary indicator for males   
binary_gender <- (gender == "Male") * 1 

# Check that the two variables are telling us the same thing
table(gender, binary_gender)

        binary_gender
gender     0   1
  Female 273   0
  Male     0 227

Next we regress Reading Achievement (achrdg08) on binary_gender and compare this to an independent samples t-test using the same two variables.

Code

# Regression with a binary variable
mod1 <- lm(achrdg08 ~ binary_gender)
summary(mod1)


Call:
lm(formula = achrdg08 ~ binary_gender)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.7278  -6.1472   0.3784   6.9765  15.0045 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    56.4678     0.5342 105.703   <2e-16 ***
binary_gender  -0.9223     0.7928  -1.163    0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.827 on 498 degrees of freedom
Multiple R-squared:  0.00271,   Adjusted R-squared:  0.0007076 
F-statistic: 1.353 on 1 and 498 DF,  p-value: 0.2452

Code

# Compare to the output of t-test
t.test(achrdg08 ~ binary_gender, var.equal = T)


    Two Sample t-test

data:  achrdg08 by binary_gender
t = 1.1633, df = 498, p-value = 0.2452
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.6353793  2.4800586
sample estimates:
mean in group 0 mean in group 1 
       56.46780        55.54546

Note that:

The intercept in mod1 is equal the mean of the group coded 0 (females) in the t-test:

\[b_0 = \bar Y_0 = 56.4678\]

The b-weight in mod1 is equal to difference between the means:

\[b_1 = \bar Y_1 - \bar Y_0 = 55.54546 - 56.4678 = -0.9223\]

The t-test of the b-weight, and its p-value, are equivalent to the t-test of the mean difference (except for the sign):

\[ t(498) = 1.1633, p = .245\]

In summary, a t-test of a b-weight of a binary predictor is equal equivalent to an independent samples t-test of means.

4.9.2 Reference-group coding

As discussed in Section 4.4, the basic idea of contrast coding is to replace a categorical variable with $C$ categories with $C-1$ “dummy” variables. In reference-group coding, the dummy variables are binary, and the resulting interpretation is:

The intercept is interpreted as the mean of the reference group. The reference is group is chosen by the analyst – it is the group that is coded zero on all dummies, or the group whose dummy variable is “left out” of the $C-1$ dummy variables.
The regression coefficients of the dummy variables are interpreted as the difference between the mean of the indicated group and the mean of the reference group

Reference-group coding is the default contrast coding in R. However, in order for contrast coding to be implemented, our categorical predictor needs to be represented in R as a special data type called a factor.

4.9.3 Factors in R

Factors are the way that R deals with categorical data. If you want to know if your variable is a factor or not, you can use the functions class or is.factor. Let’s illustrate this with the urban variable from NELS.

Code

# Two ways of checking what type a variable is 
class(urban)

[1] "factor"

Code

is.factor(urban)

[1] TRUE

Code

# Find out the levels of a factor using "levels"
levels(urban)

[1] "Rural"    "Suburban" "Urban"

Code

# Find out what contrast coding R is using for a factor "contrasts"
contrasts(urban)

         Suburban Urban
Rural           0     0
Suburban        1     0
Urban           0     1

In the above code, we see that urban is a factor with 3 levels and the default reference-group contrasts are set up so that Rural is the reference group. Below we will show how to change the contrasts for a variable.

If we are working with a variable that is not a factor, but we want R to treat it as a factor, we can use the factor command. Let’s illustrate this by turning binary_gender into a factor.

Code

# Change a numeric variable into a factor
class(binary_gender)

[1] "numeric"

Code

factor_gender <- factor(binary_gender)
class(factor_gender)

[1] "factor"

Code

levels(factor_gender)

[1] "0" "1"

We can also use the levels function to tell R what labels we want it to use for our factor.

Code

# Change the levels of factor_gender to "F" and "M"
levels(factor_gender) <- c("Female", "Male")
levels(factor_gender)

[1] "Female" "Male"

4.9.4 Back to reference-group coding

OK, back to reference coding. Let’s see what lm does when we regress achrdg08 on urban.

Code

mod2 <- lm(achrdg08 ~ urban)
summary(mod2)


Call:
lm(formula = achrdg08 ~ urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    54.9927     0.6885  79.876  < 2e-16 ***
urbanSuburban   0.6675     0.9117   0.732  0.46439    
urbanUrban      3.1275     1.0480   2.984  0.00298 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.0151 
F-statistic: 4.824 on 2 and 497 DF,  p-value: 0.008411

In the output, we see that two regression coefficients are reported, one for Suburban and one for Urban. As discussed in Section 4.4, these coefficients are the mean difference between the indicated group and the reference group (Rural).

We can see that Urban students scores significantly higher than Rural students (3.125 percentage points), but there was no significant difference between Rural and Suburban students.

The intercept is the mean of the reference group (Rural) – about 55% on the reading test.

Note the R-squared – Urbanicity accounts for about 2% of the variation reading achievement. As usual, the F-test of R-squared has degrees of freedom $K$ and $N - K -1$, but now $K$ (the number of predictors) is equal to $C - 1$ – the number of categories minus one.

4.9.5 Changing the reference group

What if we wanted to use a group other than Rural as the reference group? We can chose a different reference group using the cont.treatment function. This function takes two arguments

n tells R how many levels there
base tells R which level should be the reference group

Code

# The current reference group is Rural
contrasts(urban)

         Suburban Urban
Rural           0     0
Suburban        1     0
Urban           0     1

Code

# Chance the reference group to the Urban (i.e., the last level)
contrasts(urban) <- contr.treatment(n = 3, base = 3)
contrasts(urban)

         1 2
Rural    1 0
Suburban 0 1
Urban    0 0

Note that when we first ran contrasts(urban), the column names were names of the levels. But after changing the reference group, the column names are just the numbers 1 and 2. To help interpret the lm output, it is helpful to name the contrast levels appropriately

Code

# Naming our new contrasts
colnames(contrasts(urban)) <- c("Rural", "Suburban")
contrasts(urban)

         Rural Suburban
Rural        1        0
Suburban     0        1
Urban        0        0

Now we are ready to run our regression again, this time using a different reference group.

Code

mod3 <- lm(achrdg08 ~ urban)
summary(mod3)


Call:
lm(formula = achrdg08 ~ urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    58.1202     0.7901  73.559  < 2e-16 ***
urbanRural     -3.1275     1.0480  -2.984  0.00298 ** 
urbanSuburban  -2.4600     0.9907  -2.483  0.01335 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.0151 
F-statistic: 4.824 on 2 and 497 DF,  p-value: 0.008411

Compared to the output from mod2, note that

The intercept now represents the mean reading scores of the Urban group, because this is the new reference group.
The regression coefficients now represent the mean difference between the indicated group with the new reference group.
The R-square and F-test stay the same – in other words, the total amount of variation explained by the variable Urban does not change, just because we changed the reference group.

4.9.6 Deviation coding

It is possible to change R’s default contrast coding to one of the other built-in contrasts (see help(contrasts) for more information on the built-in contrasts).

For instance, to change to deviation coding, we use R’s contr.sum function and tell it how many levels there are for the factor (n). In deviation coding, the intercept is equal to the unweighted mean of the predicted values, and the regression coefficients are difference between the indicated group and the unweighted mean.

Code

contrasts(urban) <- contr.sum(n = 3)
contrasts(urban)

         [,1] [,2]
Rural       1    0
Suburban    0    1
Urban      -1   -1

Code

# As above, it is helpful to name the contrasts using "colnames"
colnames(contrasts(urban)) <- c("Rural", "Suburban")

Now we are all set to use deviation coding with lm.

Code

mod4 <- lm(achrdg08 ~  urban)
summary(mod4)


Call:
lm(formula = achrdg08 ~ urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    56.2577     0.4021 139.897   <2e-16 ***
urbanRural     -1.2650     0.5654  -2.237   0.0257 *  
urbanSuburban  -0.5975     0.5299  -1.128   0.2600    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.0151 
F-statistic: 4.824 on 2 and 497 DF,  p-value: 0.008411

Note the following things about the output:

The regression coefficients compare each group’s mean to the unweighted mean of the groups. Rural and Suburban students are below the unweighted mean, but the difference is only significant for Rural.
Although the output looks similar to that of mod3, the coefficients are all different and they all have different interpretations. The lm output doesn’t tell us this, we have to know what is going on under the hood.
The R-square does not change from mod2 – again, the type of contrast coding used doesn’t affect how much variation is explained by the predictor.

4.9.7 Extra: Relation to ANOVA

As our next exercise, let’s compare the output of lm and the output of aov – R’s module for Analysis of Variance. If regression and ANOVA are really doing the same thing, we should be able to illustrate it with these two modules. Note that if you check out help(aov), it explicitly states that aov uses the lm function, so, finding that the two approaches give similar output shouldn’t be a big surprise!

Code

# Run our model as an ANOVA
aov1 <- aov(achrdg08 ~ urban)

# Compare the output with lm
summary(aov1)

             Df Sum Sq Mean Sq F value  Pr(>F)   
urban         2    741   370.5   4.824 0.00841 **
Residuals   497  38163    76.8                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

summary(mod2)


Call:
lm(formula = achrdg08 ~ urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    54.9927     0.6885  79.876  < 2e-16 ***
urbanSuburban   0.6675     0.9117   0.732  0.46439    
urbanUrban      3.1275     1.0480   2.984  0.00298 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.0151 
F-statistic: 4.824 on 2 and 497 DF,  p-value: 0.008411

If we compare the output from aov to the F-test reported by lm we see that the F-test, degrees of freedom, and p-value are all identical. If we compute eta-squared from aov, we also find that it is equal to the R-squared value from lm.

Code

# Compute eta-squared from aov output
741 / (741 + 38163)

[1] 0.01904688

In short, ANOVA and regression are doing the same thing: R-squared is the same as eta-squared and the ANOVA omnibus F-test is the same as the F-test of R-squared.

4.9.8 Extra: Weighted deviation coding with all groups

This section presents a “hack” for addressing the two main issues with deviation coding noted in Section 4.5. This is a hack in the sense that we are working around R’s usual procedures. The result of this hack is to provide deviation coding in which comparisons are made to the grand mean on the outcome for all $C$ categories, not just $C-1$ categories. As a side effect, the F-test for the R-squared statistic reported in summary(lm) no longer correct, so we show how to fix that too.

First, note that if if we omit the intercept term, all of the reference-group coded dummies can be included in the model and the regression coefficients now correspond to means of each group. We omit the intercept using -1 in the model formula.

Code

# Omit the intercept using -1
mod5 <- lm(achrdg08 ~ -1 + urban)
summary(mod5)


Call:
lm(formula = achrdg08 ~ -1 + urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
urbanRural     54.9927     0.6885   79.88   <2e-16 ***
urbanSuburban  55.6602     0.5976   93.14   <2e-16 ***
urbanUrban     58.1202     0.7901   73.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.9763,    Adjusted R-squared:  0.9761 
F-statistic:  6822 on 3 and 497 DF,  p-value: < 2.2e-16

Note the following things about the output:

The coefficients are no longer interpreted as mean differences, just the raw means.
The R-square and F-test do change from mod2. When the intercept is omitted, R-squared can no longer interpretated as a proportion of variance unless the variable $Y$ variable is centered. When the intercept is omitted, it also changes the degrees of freedom in the F-test, because there are now 3 predictors instead of 2. In general, when the intercept is omitted, R-square and its F-test do not have the usual interpretation, so we need to be careful.

By itself, the “group-means” appoach to contrast coding is not very interesting – we don’t usually want to test whether the group means are different from zero. However, this approach can be used to provide an alternative to deviation coding if we center the $Y$ variable before using the above approach. After centering, the grand mean of $Y$ is zero, and so the group means represent deviations from the grand mean (i.e., deviations from zero) and the tests of the regression coefficients are tests of whether the group means are different from the grand mean.

Code

# Center the outcome variable
dev_achrdg08 <- achrdg08 - mean(achrdg08)

# Run the regression with group mean coding
mod6 <- lm(dev_achrdg08 ~ -1 + urban)
summary(mod6)


Call:
lm(formula = dev_achrdg08 ~ -1 + urban)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3002  -6.1620   0.2098   6.7948  15.5573 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
urbanRural     -1.0564     0.6885  -1.534  0.12556   
urbanSuburban  -0.3889     0.5976  -0.651  0.51554   
urbanUrban      2.0711     0.7901   2.621  0.00903 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.763 on 497 degrees of freedom
Multiple R-squared:  0.01904,   Adjusted R-squared:  0.01312 
F-statistic: 3.216 on 3 and 497 DF,  p-value: 0.02264

Note the following things about the output:

The regression coefficients compare each group’s mean to the overall mean on the outcome. Urban students are significantly above average, Rural and Suburban students are below average but the difference is not significant.
The R-square is the same as mod2 because the $Y$ variable is centered, but the F test is not the same as from mod2. The next two paragraphs explain why, but the explanation requires some material we won’t discuss until Chapter 6.

What all of this mean is that we can compute the “correct” F-test of R-squared by comparing mod6 to a model with just the intercept. The following code shows how to do this.

Code

# A model with only the intercept
mod7 <- lm(dev_achrdg08 ~ 1)

# Compute the F-test of comparing mod6 to mod7
anova(mod6, mod7)

Analysis of Variance Table

Model 1: dev_achrdg08 ~ -1 + urban
Model 2: dev_achrdg08 ~ 1
  Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
1    497 38163                                
2    499 38904 -2   -740.91 4.8244 0.008411 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is the same F-test reported for mod2, mod3 and mod4 above.