Filtering Rows¶

Instead of just sorting the rows in your data, you might want to filter out rows based on a set of conditions. You can do this with the filter() function, which uses the following syntax:

Syntax

tidyverse::filter(df, condition1, condition2, condition3, ...)

Required arguments
- df: The tibble (data frame) with the data you would like to filter.
- condition1: The logical condition that identifies the rows you would like to keep.
Optional arguments
- condition2, condition3, ...: Any additional conditions that identify the rows you would like to keep.

The conditions specified in filter() can use a variety of comparison operators:

> (greater than), < (less than)
>= (greater than or equal to), <= (less than or equal to)
== (equal to); note that a single equals sign (=) will not work
!= (not equal to)

For example, imagine we wanted to create a new data frame with only those employees who are retired. We could do this with filter() by writing a condition that specifies that the Retired variable equals TRUE:

retiredEmployees <- filter(employees, Retired == TRUE)
head(retiredEmployees)
dim(retiredEmployees)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
1933	Medina, Brandy	Female	62	7	Associate's	1979-02-23	TRUE	Sales	NA
9259	Armantrout, Allison	Female	61	7	Associate's	1980-02-23	TRUE	Operations	NA
6223	Ali, Michelle	Female	60	7	Ph.D	1982-02-23	TRUE	Corporate	NA
5955	al-Younes, Ashqar	Male	63	7	High School	1979-02-23	TRUE	Sales	NA
6195	Tolbert, Kinae	Female	60	8	High School	1983-02-23	TRUE	Sales	NA
9620	Medina, Rogelio	Male	61	2	Master's	1980-02-23	TRUE	Accounting	NA

80
10

dim(retiredEmployees)

[1] 80 10

By specifying multiple conditions in our call to filter(), we can filter by more than one rule. Let’s say we wanted a data set with employees who:

Are still working, and
Started on or after January 1, 1995, and
Are not in human resources.

employeesSub <- filter(employees, 
                       Retired == FALSE, Start_Date >= "1995-01-01", Division != "Operations")
head(employeesSub)
dim(employeesSub)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
2671	Lewis, Austin	Male	34	4	Ph.D	2007-02-23	FALSE	Engineering	182343
7821	Holleman, Shaquaisha	Female	43	8	Master's	1999-02-23	FALSE	Human Resources	149468
5915	Rogers, Arielle	Female	27	2	Bachelor's	2013-02-23	FALSE	Human Resources	79183
9871	Sontag, Taylor	Female	30	7	Ph.D	2011-02-23	FALSE	Engineering	164384
2828	el-Ullah, Saabira	Female	42	5	Bachelor's	1999-02-23	FALSE	Sales	138973
2836	Ochoa, Luis	Male	41	7	Bachelor's	1999-02-23	FALSE	Corporate	152011

376
10

dim(employeesSub)

[1] 376  10

When you list multiple conditions in filter(), those conditions are combined with “and”. In the previous example, our new data frame employeesSub contains the r nrow(employeesSub) employees who are still working, and who started on or after January 1, 1995, and who are not in human resources.

However, you might want to filter based on one condition or another. For example, imagine we wanted to find all employees who have a master’s, or who started on or before December 31, 2000, or who make less than $100,000. To do this, instead of listing each condition as a separate argument, we combine the conditions with the | character, which evaluates to “or”. For example:

employeesSubOr <- filter(employees, 
                         Degree == "Master's" | Start_Date <= "2000-12-31" | Salary < 100000)
head(employeesSubOr)
dim(employeesSubOr)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
6881	al-Rahimi, Tayyiba	Female	51	10	High School	1990-02-23	FALSE	Operations	108804
8925	el-Jaffer, Manaal	Female	50	10	Master's	1991-02-23	FALSE	Engineering	206770
2769	Soto, Michael	Male	52	10	High School	1987-02-23	FALSE	Sales	183407
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	1985-02-23	FALSE	Corporate	236240
1933	Medina, Brandy	Female	62	7	Associate's	1979-02-23	TRUE	Sales	NA
3570	Troftgruben, Tierra	Female	47	8	High School	1995-02-23	FALSE	Operations	101138

741
10

dim(employeesSubOr)

[1] 741  10

Now let’s create a data frame with the employees who have a Master’s or a Ph.D. We could do this using the or operator |:

employeesGrad <- filter(employees, Degree == "Master's" | Degree == "Ph.D")
head(employeesGrad)
dim(employeesGrad)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
2671	Lewis, Austin	Male	34	4	Ph.D	2007-02-23	FALSE	Engineering	182343
8925	el-Jaffer, Manaal	Female	50	10	Master's	1991-02-23	FALSE	Engineering	206770
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	1985-02-23	FALSE	Corporate	236240
7821	Holleman, Shaquaisha	Female	43	8	Master's	1999-02-23	FALSE	Human Resources	149468
9871	Sontag, Taylor	Female	30	7	Ph.D	2011-02-23	FALSE	Engineering	164384
2687	Benavidez, Juan	Male	45	8	Ph.D	1997-02-23	FALSE	Engineering	181924

400
10

dim(employeesGrad)

[1] 400  10

Now imagine we wanted to find all employees who have a Master’s, a Ph.D, or a Bachelor’s. We could add another | to our condition and specify that Degree == "Bachelor's". Alternatively, we could make our code more compact by re-writing the condition as var_name %in% values. This will filter to only those rows where var_name is equal to one of the values specified in the atomic vector values. For example:

employeesCollege <- filter(employees, Degree %in% c("Bachelor's", "Master's", "Ph.D"))
head(employeesCollege)
dim(employeesCollege)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
2671	Lewis, Austin	Male	34	4	Ph.D	2007-02-23	FALSE	Engineering	182343
8925	el-Jaffer, Manaal	Female	50	10	Master's	1991-02-23	FALSE	Engineering	206770
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	1985-02-23	FALSE	Corporate	236240
7821	Holleman, Shaquaisha	Female	43	8	Master's	1999-02-23	FALSE	Human Resources	149468
5915	Rogers, Arielle	Female	27	2	Bachelor's	2013-02-23	FALSE	Human Resources	79183
9871	Sontag, Taylor	Female	30	7	Ph.D	2011-02-23	FALSE	Engineering	164384

600
10

dim(employeesCollege)

[1] 600  10

Warning

Be careful filtering data when you have missing values (NA).

The filter() function keeps only those rows where the specified condition(s) evaluate(s) to TRUE. This is complicated by the presence of missing values, as it is impossible to determine whether a condition is TRUE or FALSE if the relevant information is missing. In our employees data set, we are missing a Salary for around 5% of the employees. If we wanted to filter to only those individuals who made more than $100,000, how would filter() treat the 5% of employees with NA values for Salary? Because the condition does not evaluate to TRUE for these rows, they are dropped.

employees100k <- filter(employees, Salary >= 100000)
head(employees100k)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
6881	al-Rahimi, Tayyiba	Female	51	10	High School	1990-02-23	FALSE	Operations	108804
2671	Lewis, Austin	Male	34	4	Ph.D	2007-02-23	FALSE	Engineering	182343
8925	el-Jaffer, Manaal	Female	50	10	Master's	1991-02-23	FALSE	Engineering	206770
2769	Soto, Michael	Male	52	10	High School	1987-02-23	FALSE	Sales	183407
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	1985-02-23	FALSE	Corporate	236240
3570	Troftgruben, Tierra	Female	47	8	High School	1995-02-23	FALSE	Operations	101138

In this example, employees100k would not include any of the employees whose salary is unknown. Note that R provides no warning that these rows are being excluded as well; it is up to the user to recognize that the data contains missing values, and that this will affect how data are filtered.

If you did not want R to drop the missing values, you could explicitly state in the condition to keep all rows where Salary is greater than or equal to $100,000, or where Salary is missing. As we saw before, we can combine conditions with “or” using the | character. It may be tempting to assume we should add Salary == NA in order to capture the rows where Salary is missing; however, this will not work. We cannot use logical operators like == to compare something to NA. Instead have to use the is.na(), as follows:

employees100kNA <- filter(employees, Salary >= 100000 | is.na(Salary))
head(employees100kNA)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
6881	al-Rahimi, Tayyiba	Female	51	10	High School	1990-02-23	FALSE	Operations	108804
2671	Lewis, Austin	Male	34	4	Ph.D	2007-02-23	FALSE	Engineering	182343
8925	el-Jaffer, Manaal	Female	50	10	Master's	1991-02-23	FALSE	Engineering	206770
2769	Soto, Michael	Male	52	10	High School	1987-02-23	FALSE	Sales	183407
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	1985-02-23	FALSE	Corporate	236240
1933	Medina, Brandy	Female	62	7	Associate's	1979-02-23	TRUE	Sales	NA

tail(employees100kNA)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
5271	Leschinsky, Johanna	Female	36	4	Ph.D	2004-02-23	FALSE	Corporate	160364
6681	Bruns, Austin	Male	41	8	Ph.D	2002-02-23	FALSE	Engineering	188656
2031	Martinez, Caleb	Male	57	8	Ph.D	1984-02-23	FALSE	Engineering	218430
3239	Larson, Trusten	Male	37	5	Bachelor's	2002-02-23	FALSE	Human Resources	149789
3717	Levy-Minter, Quintin	Male	53	10	Bachelor's	1989-02-23	FALSE	Operations	172703
4209	Dena, Gianna	Female	49	6	Master's	1991-02-23	FALSE	Accounting	185445

Data Science for Managers

Filtering Rows¶