5.5. Dummy Variables¶
So far we have been modeling Salary
as a function of Age
and Rating
, which are both quantitative variables. However, there are also categorical variables in our data set that we might want to include in the model. For example, we might be interested in exploring whether there is a gender bias in our salary data. To do this, we need to somehow include Gender
as an independent variable in our regression. However, this variable is not measured numerically; it takes on the values “Male” or “Female”. How, then, can we include it in our regression model?
To do this we need to create a dummy variable, or a binary, quantitative variable that is used to represent categorical data. Dummy variables use the numeric quantities 0
and 1
to represent the categories of interest. In our case our categorical variable is Gender
, so our dummy needs to encode one of the genders as 0
and the other as 1
. (Note that it is largely arbitrary which category we assign to 1
and which we assign to 0
.) Imagine we created a new variable in our data set called male_dummy
, which equals 1
if the employee is male and 0
if the employee is female:
employees$male_dummy <- ifelse(employees$Gender == "Male", 1, 0)
head(employees)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary | male_dummy |
---|---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 | 0 |
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 | 1 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 | 0 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 | 1 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 | 1 |
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA | 0 |
Now that we’ve created a dummy variable for gender, we can add it to our regression model:
modelMaleDummy <- lm(Salary ~ Age + Rating + male_dummy, data = employees)
summary(modelMaleDummy)
Call:
lm(formula = Salary ~ Age + Rating + male_dummy, data = employees)
Residuals:
Min 1Q Median 3Q Max
-91483 -21741 803 22130 85908
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27047.67 5472.39 4.943 9.17e-07 ***
Age 1964.33 92.04 21.343 < 2e-16 ***
Rating 5520.34 526.47 10.486 < 2e-16 ***
male_dummy 8278.08 2017.04 4.104 4.42e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30570 on 916 degrees of freedom
(80 observations deleted due to missingness)
Multiple R-squared: 0.4025, Adjusted R-squared: 0.4006
F-statistic: 205.7 on 3 and 916 DF, p-value: < 2.2e-16
This produces the following estimated regression equation:
The interpretation of the coefficients on Age
and Rating
are the same as before; for example, we would say that on average, salary goes up by $5,520.34 for each additional point in an employee’s rating, assuming all other variables in the model are kept constant. However, a similar interpretation would not be possible for the coefficient on male_dummy
, as this variable can only take on the values 0
and 1
. Instead, we interpret this coefficient as follows: on average, men at the company are paid $8,278.08 more than women at the company, assuming all other variables in the model are kept constant. Because the p-value on this coefficient is quite small, we might conclude from these results that there does appear to be a gender bias in the company’s salary data.
To better understand the coefficient on the dummy variable, consider two employees, one male and one female. Assume that they are both 35, and both received an 8 in their last performance evaluation. For the male employee, our model would predict his salary to be:
Conversely, for the female employee, our model would predict her salary to be:
From this example, we can see that when the other variables are held constant, the difference between the predicted salary for men and women is the value of the coefficient on our dummy variable ($8,278.08).
In this example, we manually created a dummy variable for gender (called male_dummy
) and used that variable in our model. However, this was not actually necessary - the lm()
function automatically converts any categorical variables you include into dummy variables behind the scenes. If we specify our model as before but include Gender
instead of male_dummy
, we get the same results:
modelMaleDummy <- lm(Salary ~ Age + Rating + Gender, data = employees)
summary(modelMaleDummy)
Call:
lm(formula = Salary ~ Age + Rating + Gender, data = employees)
Residuals:
Min 1Q Median 3Q Max
-91483 -21741 803 22130 85908
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27047.67 5472.39 4.943 9.17e-07 ***
Age 1964.33 92.04 21.343 < 2e-16 ***
Rating 5520.34 526.47 10.486 < 2e-16 ***
GenderMale 8278.08 2017.04 4.104 4.42e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30570 on 916 degrees of freedom
(80 observations deleted due to missingness)
Multiple R-squared: 0.4025, Adjusted R-squared: 0.4006
F-statistic: 205.7 on 3 and 916 DF, p-value: < 2.2e-16
Dummy variables are relatively straightforward for binary categorical variables such as Gender
. But what about a variable like Degree
, which can take on several different values ("High School"
, "Associate's"
, "Bachelor's"
, "Master's"
, and "Ph.D"
)? It might be natural to assume that you could just assign a unique integer to each category. In other words, our dummy for degree could represent “High School” as 0, “Associate’s” as 1, “Bachelor’s” as 2, “Master’s” as 3, and “Ph.D” as 4. However, this method of coding categorical variables is problematic. First, it implies an ordering to the categories that may not be correct. There might be an ordering to Degree
, but consider the Division
variable; there is no inherent ordering to the divisions of a company, so any ordering implied by a dummy variable would be arbitrary. Second, this method of coding implies a fixed difference between each category. There is no reason to believe that the difference between an associate’s and a bachelor’s is the same as the difference between a bachelor’s and a master’s, for example.
How, then, do we incorporate multinomial categorical variables into our regression model? The answer is by creating separate 0 / 1 dummy variables for each of the variable’s categories. For example, we will need one dummy variable (DegreeAssociate's
) that equals 1
for observations where Degree
equals “Associate’s”, and 0
if not. We will need another dummy variable (DegreeBachelor's
) that equals 1
for observations where Degree
equals “Bachelor’s” and 0
if not, and so on. The table below shows how all possible values of Degree
can be represented through four binary dummy variables:
Degree | DegreeAssociate's | DegreeBachelor's | DegreeMaster's | DegreePh.D |
---|---|---|---|---|
High School | 0 | 0 | 0 | 0 |
Associate's | 1 | 0 | 0 | 0 |
Bachelor's | 0 | 1 | 0 | 0 |
Master's | 0 | 0 | 1 | 0 |
Ph.D | 0 | 0 | 0 | 1 |
Note that we do not need a fifth dummy variable to represent the “High School” category. This is because this information is already implicitly captured in the other four dummy variables; if DegreeAssociate's
, DegreeBachelor's
, DegreeMaster's
, and DegreePh.D
all equal zero, we know the employee must hold a high school diploma, so there is no need for an additional DegreeHighSchool
variable. In general, a \(k\)-category variable can be represented with \(k-1\) dummy variables.
Warning
For regression modeling, categorical variables that take on \(k\) values must be converted into \(k-1\) binary dummy variables.
As noted above, the lm()
command automatically creates dummy variables behind the scenes, so we can simply include Degree
in our call to lm()
:
modelDegree <- lm(Salary ~ Age + Rating + Gender + Degree, data = employees)
summary(modelDegree)
Call:
lm(formula = Salary ~ Age + Rating + Gender + Degree, data = employees)
Residuals:
Min 1Q Median 3Q Max
-64403 -16227 352 15917 70513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.427 4502.240 0.001 0.999216
Age 2006.083 70.078 28.627 < 2e-16 ***
Rating 5181.073 401.489 12.905 < 2e-16 ***
GenderMale 8220.111 1532.334 5.364 1.03e-07 ***
DegreeAssociate's 9477.556 2444.091 3.878 0.000113 ***
DegreeBachelor's 33065.808 2426.033 13.630 < 2e-16 ***
DegreeMaster's 40688.574 2410.054 16.883 < 2e-16 ***
DegreePh.D 53730.605 2408.267 22.311 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23220 on 912 degrees of freedom
(80 observations deleted due to missingness)
Multiple R-squared: 0.6568, Adjusted R-squared: 0.6541
F-statistic: 249.3 on 7 and 912 DF, p-value: < 2.2e-16
To interpret the coefficients on the dummy variables for degree, we must first acknowledge that they are all relative to the implicit baseline category, "High School"
. The baseline (or reference) category is the one that is not given its own dummy variable; in this case, we do not have a separate dummy for "High School"
, so it is our baseline. With this in mind, we interpret the coefficients on our dummy variables as follows:
On average, employees with an Associate’s degree are paid $9,477.56 more than employees with a high school diploma, assuming all other variables in the model are kept constant.
On average, employees with a Bachelor’s degree are paid $33,065.81 more than employees with a high school diploma, assuming all other variables in the model are kept constant.
On average, employees with a Master’s degree are paid $40,688.57 more than employees with a high school diploma, assuming all other variables in the model are kept constant.
On average, employees with a Ph.D are paid $53,730.61 more than employees with a high school diploma, assuming all other variables in the model are kept constant.