Filtering Rows¶
Instead of just sorting the rows in your data, you might want to filter out rows based on a set of conditions. You can do this with the filter()
function, which uses the following syntax:
Syntax
tidyverse::filter(df, condition1, condition2, condition3, ...)
Required arguments
df
: The tibble (data frame) with the data you would like to filter.condition1
: The logical condition that identifies the rows you would like to keep.
Optional arguments
condition2, condition3, ...
: Any additional conditions that identify the rows you would like to keep.
The conditions specified in filter()
can use a variety of comparison operators:
>
(greater than),<
(less than)>=
(greater than or equal to),<=
(less than or equal to)==
(equal to); note that a single equals sign (=
) will not work!=
(not equal to)
For example, imagine we wanted to create a new data frame with only those employees who are retired. We could do this with filter()
by writing a condition that specifies that the Retired
variable equals TRUE
:
retiredEmployees <- filter(employees, Retired == TRUE)
head(retiredEmployees)
dim(retiredEmployees)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |
9259 | Armantrout, Allison | Female | 61 | 7 | Associate's | 1980-02-23 | TRUE | Operations | NA |
6223 | Ali, Michelle | Female | 60 | 7 | Ph.D | 1982-02-23 | TRUE | Corporate | NA |
5955 | al-Younes, Ashqar | Male | 63 | 7 | High School | 1979-02-23 | TRUE | Sales | NA |
6195 | Tolbert, Kinae | Female | 60 | 8 | High School | 1983-02-23 | TRUE | Sales | NA |
9620 | Medina, Rogelio | Male | 61 | 2 | Master's | 1980-02-23 | TRUE | Accounting | NA |
- 80
- 10
dim(retiredEmployees)
[1] 80 10
By specifying multiple conditions in our call to filter()
, we can filter by more than one rule. Let’s say we wanted a data set with employees who:
Are still working, and
Started on or after January 1, 1995, and
Are not in human resources.
employeesSub <- filter(employees,
Retired == FALSE, Start_Date >= "1995-01-01", Division != "Operations")
head(employeesSub)
dim(employeesSub)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |
7821 | Holleman, Shaquaisha | Female | 43 | 8 | Master's | 1999-02-23 | FALSE | Human Resources | 149468 |
5915 | Rogers, Arielle | Female | 27 | 2 | Bachelor's | 2013-02-23 | FALSE | Human Resources | 79183 |
9871 | Sontag, Taylor | Female | 30 | 7 | Ph.D | 2011-02-23 | FALSE | Engineering | 164384 |
2828 | el-Ullah, Saabira | Female | 42 | 5 | Bachelor's | 1999-02-23 | FALSE | Sales | 138973 |
2836 | Ochoa, Luis | Male | 41 | 7 | Bachelor's | 1999-02-23 | FALSE | Corporate | 152011 |
- 376
- 10
dim(employeesSub)
[1] 376 10
When you list multiple conditions in filter()
, those conditions are combined with “and”. In the previous example, our new data frame employeesSub
contains the r nrow(employeesSub)
employees who are still working, and who started on or after January 1, 1995, and who are not in human resources.
However, you might want to filter based on one condition or another. For example, imagine we wanted to find all employees who have a master’s, or who started on or before December 31, 2000, or who make less than $100,000. To do this, instead of listing each condition as a separate argument, we combine the conditions with the |
character, which evaluates to “or”. For example:
employeesSubOr <- filter(employees,
Degree == "Master's" | Start_Date <= "2000-12-31" | Salary < 100000)
head(employeesSubOr)
dim(employeesSubOr)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |
3570 | Troftgruben, Tierra | Female | 47 | 8 | High School | 1995-02-23 | FALSE | Operations | 101138 |
- 741
- 10
dim(employeesSubOr)
[1] 741 10
Now let’s create a data frame with the employees who have a Master’s or a Ph.D. We could do this using the or operator |
:
employeesGrad <- filter(employees, Degree == "Master's" | Degree == "Ph.D")
head(employeesGrad)
dim(employeesGrad)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |
7821 | Holleman, Shaquaisha | Female | 43 | 8 | Master's | 1999-02-23 | FALSE | Human Resources | 149468 |
9871 | Sontag, Taylor | Female | 30 | 7 | Ph.D | 2011-02-23 | FALSE | Engineering | 164384 |
2687 | Benavidez, Juan | Male | 45 | 8 | Ph.D | 1997-02-23 | FALSE | Engineering | 181924 |
- 400
- 10
dim(employeesGrad)
[1] 400 10
Now imagine we wanted to find all employees who have a Master’s, a Ph.D, or a Bachelor’s. We could add another |
to our condition and specify that Degree == "Bachelor's"
. Alternatively, we could make our code more compact by re-writing the condition as var_name %in% values
. This will filter to only those rows where var_name
is equal to one of the values specified in the atomic vector values
. For example:
employeesCollege <- filter(employees, Degree %in% c("Bachelor's", "Master's", "Ph.D"))
head(employeesCollege)
dim(employeesCollege)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |
7821 | Holleman, Shaquaisha | Female | 43 | 8 | Master's | 1999-02-23 | FALSE | Human Resources | 149468 |
5915 | Rogers, Arielle | Female | 27 | 2 | Bachelor's | 2013-02-23 | FALSE | Human Resources | 79183 |
9871 | Sontag, Taylor | Female | 30 | 7 | Ph.D | 2011-02-23 | FALSE | Engineering | 164384 |
- 600
- 10
dim(employeesCollege)
[1] 600 10
Warning
Be careful filtering data when you have missing values (NA
).
The filter()
function keeps only those rows where the specified condition(s) evaluate(s) to TRUE
. This is complicated by the presence of missing values, as it is impossible to determine whether a condition is TRUE
or FALSE
if the relevant information is missing. In our employees data set, we are missing a Salary
for around 5% of the employees. If we wanted to filter to only those individuals who made more than $100,000, how would filter()
treat the 5% of employees with NA
values for Salary
? Because the condition does not evaluate to TRUE
for these rows, they are dropped.
employees100k <- filter(employees, Salary >= 100000)
head(employees100k)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |
3570 | Troftgruben, Tierra | Female | 47 | 8 | High School | 1995-02-23 | FALSE | Operations | 101138 |
In this example, employees100k
would not include any of the employees whose salary is unknown. Note that R provides no warning that these rows are being excluded as well; it is up to the user to recognize that the data contains missing values, and that this will affect how data are filtered.
If you did not want R to drop the missing values, you could explicitly state in the condition to keep all rows where Salary
is greater than or equal to $100,000, or where Salary
is missing. As we saw before, we can combine conditions with “or” using the |
character. It may be tempting to assume we should add Salary == NA
in order to capture the rows where Salary
is missing; however, this will not work. We cannot use logical operators like ==
to compare something to NA
. Instead have to use the is.na()
, as follows:
employees100kNA <- filter(employees, Salary >= 100000 | is.na(Salary))
head(employees100kNA)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |
tail(employees100kNA)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
5271 | Leschinsky, Johanna | Female | 36 | 4 | Ph.D | 2004-02-23 | FALSE | Corporate | 160364 |
6681 | Bruns, Austin | Male | 41 | 8 | Ph.D | 2002-02-23 | FALSE | Engineering | 188656 |
2031 | Martinez, Caleb | Male | 57 | 8 | Ph.D | 1984-02-23 | FALSE | Engineering | 218430 |
3239 | Larson, Trusten | Male | 37 | 5 | Bachelor's | 2002-02-23 | FALSE | Human Resources | 149789 |
3717 | Levy-Minter, Quintin | Male | 53 | 10 | Bachelor's | 1989-02-23 | FALSE | Operations | 172703 |
4209 | Dena, Gianna | Female | 49 | 6 | Master's | 1991-02-23 | FALSE | Accounting | 185445 |