Filtering Rows

Instead of just sorting the rows in your data, you might want to filter out rows based on a set of conditions. You can do this with the filter() function, which uses the following syntax:

Syntax

tidyverse::filter(df, condition1, condition2, condition3, ...)

  • Required arguments

    • df: The tibble (data frame) with the data you would like to filter.

    • condition1: The logical condition that identifies the rows you would like to keep.

  • Optional arguments

    • condition2, condition3, ...: Any additional conditions that identify the rows you would like to keep.

The conditions specified in filter() can use a variety of comparison operators:

  • > (greater than), < (less than)

  • >= (greater than or equal to), <= (less than or equal to)

  • == (equal to); note that a single equals sign (=) will not work

  • != (not equal to)

For example, imagine we wanted to create a new data frame with only those employees who are retired. We could do this with filter() by writing a condition that specifies that the Retired variable equals TRUE:

retiredEmployees <- filter(employees, Retired == TRUE)
head(retiredEmployees)
dim(retiredEmployees)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA
9259 Armantrout, AllisonFemale 61 7 Associate's 1980-02-23 TRUE Operations NA
6223 Ali, Michelle Female 60 7 Ph.D 1982-02-23 TRUE Corporate NA
5955 al-Younes, Ashqar Male 63 7 High School 1979-02-23 TRUE Sales NA
6195 Tolbert, Kinae Female 60 8 High School 1983-02-23 TRUE Sales NA
9620 Medina, Rogelio Male 61 2 Master's 1980-02-23 TRUE Accounting NA
  1. 80
  2. 10
dim(retiredEmployees)
[1] 80 10

By specifying multiple conditions in our call to filter(), we can filter by more than one rule. Let’s say we wanted a data set with employees who:

  • Are still working, and

  • Started on or after January 1, 1995, and

  • Are not in human resources.

employeesSub <- filter(employees, 
                       Retired == FALSE, Start_Date >= "1995-01-01", Division != "Operations")
head(employeesSub)
dim(employeesSub)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
7821 Holleman, ShaquaishaFemale 43 8 Master's 1999-02-23 FALSE Human Resources 149468
5915 Rogers, Arielle Female 27 2 Bachelor's 2013-02-23 FALSE Human Resources 79183
9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE Engineering 164384
2828 el-Ullah, Saabira Female 42 5 Bachelor's 1999-02-23 FALSE Sales 138973
2836 Ochoa, Luis Male 41 7 Bachelor's 1999-02-23 FALSE Corporate 152011
  1. 376
  2. 10
dim(employeesSub)
[1] 376  10

When you list multiple conditions in filter(), those conditions are combined with “and”. In the previous example, our new data frame employeesSub contains the r nrow(employeesSub) employees who are still working, and who started on or after January 1, 1995, and who are not in human resources.

However, you might want to filter based on one condition or another. For example, imagine we wanted to find all employees who have a master’s, or who started on or before December 31, 2000, or who make less than $100,000. To do this, instead of listing each condition as a separate argument, we combine the conditions with the | character, which evaluates to “or”. For example:

employeesSubOr <- filter(employees, 
                         Degree == "Master's" | Start_Date <= "2000-12-31" | Salary < 100000)
head(employeesSubOr)
dim(employeesSubOr)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA
3570 Troftgruben, TierraFemale 47 8 High School 1995-02-23 FALSE Operations 101138
  1. 741
  2. 10
dim(employeesSubOr)
[1] 741  10

Now let’s create a data frame with the employees who have a Master’s or a Ph.D. We could do this using the or operator |:

employeesGrad <- filter(employees, Degree == "Master's" | Degree == "Ph.D")
head(employeesGrad)
dim(employeesGrad)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
7821 Holleman, ShaquaishaFemale 43 8 Master's 1999-02-23 FALSE Human Resources 149468
9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE Engineering 164384
2687 Benavidez, Juan Male 45 8 Ph.D 1997-02-23 FALSE Engineering 181924
  1. 400
  2. 10
dim(employeesGrad)
[1] 400  10

Now imagine we wanted to find all employees who have a Master’s, a Ph.D, or a Bachelor’s. We could add another | to our condition and specify that Degree == "Bachelor's". Alternatively, we could make our code more compact by re-writing the condition as var_name %in% values. This will filter to only those rows where var_name is equal to one of the values specified in the atomic vector values. For example:

employeesCollege <- filter(employees, Degree %in% c("Bachelor's", "Master's", "Ph.D"))
head(employeesCollege)
dim(employeesCollege)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
7821 Holleman, ShaquaishaFemale 43 8 Master's 1999-02-23 FALSE Human Resources 149468
5915 Rogers, Arielle Female 27 2 Bachelor's 2013-02-23 FALSE Human Resources 79183
9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE Engineering 164384
  1. 600
  2. 10
dim(employeesCollege)
[1] 600  10

Warning

Be careful filtering data when you have missing values (NA).

The filter() function keeps only those rows where the specified condition(s) evaluate(s) to TRUE. This is complicated by the presence of missing values, as it is impossible to determine whether a condition is TRUE or FALSE if the relevant information is missing. In our employees data set, we are missing a Salary for around 5% of the employees. If we wanted to filter to only those individuals who made more than $100,000, how would filter() treat the 5% of employees with NA values for Salary? Because the condition does not evaluate to TRUE for these rows, they are dropped.

employees100k <- filter(employees, Salary >= 100000)
head(employees100k)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
3570 Troftgruben, TierraFemale 47 8 High School 1995-02-23 FALSE Operations 101138

In this example, employees100k would not include any of the employees whose salary is unknown. Note that R provides no warning that these rows are being excluded as well; it is up to the user to recognize that the data contains missing values, and that this will affect how data are filtered.

If you did not want R to drop the missing values, you could explicitly state in the condition to keep all rows where Salary is greater than or equal to $100,000, or where Salary is missing. As we saw before, we can combine conditions with “or” using the | character. It may be tempting to assume we should add Salary == NA in order to capture the rows where Salary is missing; however, this will not work. We cannot use logical operators like == to compare something to NA. Instead have to use the is.na(), as follows:

employees100kNA <- filter(employees, Salary >= 100000 | is.na(Salary))
head(employees100kNA)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA
tail(employees100kNA)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
5271 Leschinsky, Johanna Female 36 4 Ph.D 2004-02-23 FALSE Corporate 160364
6681 Bruns, Austin Male 41 8 Ph.D 2002-02-23 FALSE Engineering 188656
2031 Martinez, Caleb Male 57 8 Ph.D 1984-02-23 FALSE Engineering 218430
3239 Larson, Trusten Male 37 5 Bachelor's 2002-02-23 FALSE Human Resources 149789
3717 Levy-Minter, QuintinMale 53 10 Bachelor's 1989-02-23 FALSE Operations 172703
4209 Dena, Gianna Female 49 6 Master's 1991-02-23 FALSE Accounting 185445