{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "039fc568", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Registered S3 methods overwritten by 'ggplot2':\n", " method from \n", " [.quosures rlang\n", " c.quosures rlang\n", " print.quosures rlang\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Registered S3 method overwritten by 'rvest':\n", " method from\n", " read_xml.response xml2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "-- Attaching packages --------------------------------------- tidyverse 1.2.1 --\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "v ggplot2 3.1.1 v purrr 0.3.2 \n", "v tibble 2.1.1 v dplyr 0.8.0.1\n", "v tidyr 0.8.3 v stringr 1.4.0 \n", "v readr 1.3.1 v forcats 0.4.0 \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "-- Conflicts ------------------------------------------ tidyverse_conflicts() --\n", "x dplyr::filter() masks stats::filter()\n", "x dplyr::lag() masks stats::lag()\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Parsed with column specification:\n", "cols(\n", " ID = col_double(),\n", " Name = col_character(),\n", " Gender = col_character(),\n", " Age = col_double(),\n", " Rating = col_double(),\n", " Degree = col_character(),\n", " Start_Date = col_character(),\n", " Retired = col_logical(),\n", " Division = col_character(),\n", " Salary = col_character()\n", ")\n" ] } ], "source": [ "library(tidyverse)\n", "employees <- read_csv(\"../../_build/data/employee_data.csv\")\n", "employees$Salary <- parse_number(employees$Salary)\n", "employees$Start_Date <- parse_date(employees$Start_Date, format = \"%m/%d/%Y\")" ] }, { "cell_type": "markdown", "id": "1b9b39b3", "metadata": {}, "source": [ "# Filtering Rows\n", "\n", "Instead of just sorting the rows in your data, you might want to filter out rows based on a set of conditions. You can do this with the `filter()` function, which uses the following syntax:\n", "\n", "```{admonition} Syntax\n", "`tidyverse::filter(df, condition1, condition2, condition3, ...)`\n", "+ *Required arguments*\n", " - `df`: The tibble (data frame) with the data you would like to filter. \n", " - `condition1`: The logical condition that identifies the rows you would like to keep.\n", "+ *Optional arguments*\n", " - `condition2, condition3, ...`: Any additional conditions that identify the rows you would like to keep.\n", "```\n", " \n", "The conditions specified in `filter()` can use a variety of **comparison operators**:\n", "\n", "+ `>` (greater than), `<` (less than)\n", "+ `>=` (greater than or equal to), `<=` (less than or equal to)\n", "+ `==` (equal to); note that a single equals sign (`=`) will *not* work\n", "+ `!=` (not equal to)\n", "\n", "For example, imagine we wanted to create a new data frame with only those employees who are retired. We could do this with `filter()` by writing a condition that specifies that the `Retired` variable equals `TRUE`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "2a618398", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA
9259 Armantrout, AllisonFemale 61 7 Associate's 1980-02-23 TRUE Operations NA
6223 Ali, Michelle Female 60 7 Ph.D 1982-02-23 TRUE Corporate NA
5955 al-Younes, Ashqar Male 63 7 High School 1979-02-23 TRUE Sales NA
6195 Tolbert, Kinae Female 60 8 High School 1983-02-23 TRUE Sales NA
9620 Medina, Rogelio Male 61 2 Master's 1980-02-23 TRUE Accounting NA
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 1933 & Medina, Brandy & Female & 62 & 7 & Associate's & 1979-02-23 & TRUE & Sales & NA \\\\\n", "\t 9259 & Armantrout, Allison & Female & 61 & 7 & Associate's & 1980-02-23 & TRUE & Operations & NA \\\\\n", "\t 6223 & Ali, Michelle & Female & 60 & 7 & Ph.D & 1982-02-23 & TRUE & Corporate & NA \\\\\n", "\t 5955 & al-Younes, Ashqar & Male & 63 & 7 & High School & 1979-02-23 & TRUE & Sales & NA \\\\\n", "\t 6195 & Tolbert, Kinae & Female & 60 & 8 & High School & 1983-02-23 & TRUE & Sales & NA \\\\\n", "\t 9620 & Medina, Rogelio & Male & 61 & 2 & Master's & 1980-02-23 & TRUE & Accounting & NA \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |\n", "| 9259 | Armantrout, Allison | Female | 61 | 7 | Associate's | 1980-02-23 | TRUE | Operations | NA |\n", "| 6223 | Ali, Michelle | Female | 60 | 7 | Ph.D | 1982-02-23 | TRUE | Corporate | NA |\n", "| 5955 | al-Younes, Ashqar | Male | 63 | 7 | High School | 1979-02-23 | TRUE | Sales | NA |\n", "| 6195 | Tolbert, Kinae | Female | 60 | 8 | High School | 1983-02-23 | TRUE | Sales | NA |\n", "| 9620 | Medina, Rogelio | Male | 61 | 2 | Master's | 1980-02-23 | TRUE | Accounting | NA |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE \n", "2 9259 Armantrout, Allison Female 61 7 Associate's 1980-02-23 TRUE \n", "3 6223 Ali, Michelle Female 60 7 Ph.D 1982-02-23 TRUE \n", "4 5955 al-Younes, Ashqar Male 63 7 High School 1979-02-23 TRUE \n", "5 6195 Tolbert, Kinae Female 60 8 High School 1983-02-23 TRUE \n", "6 9620 Medina, Rogelio Male 61 2 Master's 1980-02-23 TRUE \n", " Division Salary\n", "1 Sales NA \n", "2 Operations NA \n", "3 Corporate NA \n", "4 Sales NA \n", "5 Sales NA \n", "6 Accounting NA " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    \n", "\t
  1. 80
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 80\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 80\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 80 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "retiredEmployees <- filter(employees, Retired == TRUE)\n", "head(retiredEmployees)\n", "dim(retiredEmployees)" ] }, { "cell_type": "code", "execution_count": 3, "id": "64191079", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 80
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 80\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 80\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 80 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dim(retiredEmployees)" ] }, { "cell_type": "code", "execution_count": 4, "id": "097bc54d", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 80 10\n" ] } ], "source": [ "print(dim(retiredEmployees))" ] }, { "cell_type": "markdown", "id": "01f5562b", "metadata": {}, "source": [ "By specifying multiple conditions in our call to `filter()`, we can filter by more than one rule. Let's say we wanted a data set with employees who:\n", "\n", "+ Are still working, *and*\n", "+ Started on or after January 1, 1995, *and*\n", "+ Are *not* in human resources." ] }, { "cell_type": "code", "execution_count": 5, "id": "fe960340", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
7821 Holleman, ShaquaishaFemale 43 8 Master's 1999-02-23 FALSE Human Resources 149468
5915 Rogers, Arielle Female 27 2 Bachelor's 2013-02-23 FALSE Human Resources 79183
9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE Engineering 164384
2828 el-Ullah, Saabira Female 42 5 Bachelor's 1999-02-23 FALSE Sales 138973
2836 Ochoa, Luis Male 41 7 Bachelor's 1999-02-23 FALSE Corporate 152011
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 2671 & Lewis, Austin & Male & 34 & 4 & Ph.D & 2007-02-23 & FALSE & Engineering & 182343 \\\\\n", "\t 7821 & Holleman, Shaquaisha & Female & 43 & 8 & Master's & 1999-02-23 & FALSE & Human Resources & 149468 \\\\\n", "\t 5915 & Rogers, Arielle & Female & 27 & 2 & Bachelor's & 2013-02-23 & FALSE & Human Resources & 79183 \\\\\n", "\t 9871 & Sontag, Taylor & Female & 30 & 7 & Ph.D & 2011-02-23 & FALSE & Engineering & 164384 \\\\\n", "\t 2828 & el-Ullah, Saabira & Female & 42 & 5 & Bachelor's & 1999-02-23 & FALSE & Sales & 138973 \\\\\n", "\t 2836 & Ochoa, Luis & Male & 41 & 7 & Bachelor's & 1999-02-23 & FALSE & Corporate & 152011 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |\n", "| 7821 | Holleman, Shaquaisha | Female | 43 | 8 | Master's | 1999-02-23 | FALSE | Human Resources | 149468 |\n", "| 5915 | Rogers, Arielle | Female | 27 | 2 | Bachelor's | 2013-02-23 | FALSE | Human Resources | 79183 |\n", "| 9871 | Sontag, Taylor | Female | 30 | 7 | Ph.D | 2011-02-23 | FALSE | Engineering | 164384 |\n", "| 2828 | el-Ullah, Saabira | Female | 42 | 5 | Bachelor's | 1999-02-23 | FALSE | Sales | 138973 |\n", "| 2836 | Ochoa, Luis | Male | 41 | 7 | Bachelor's | 1999-02-23 | FALSE | Corporate | 152011 |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE \n", "2 7821 Holleman, Shaquaisha Female 43 8 Master's 1999-02-23 FALSE \n", "3 5915 Rogers, Arielle Female 27 2 Bachelor's 2013-02-23 FALSE \n", "4 9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE \n", "5 2828 el-Ullah, Saabira Female 42 5 Bachelor's 1999-02-23 FALSE \n", "6 2836 Ochoa, Luis Male 41 7 Bachelor's 1999-02-23 FALSE \n", " Division Salary\n", "1 Engineering 182343\n", "2 Human Resources 149468\n", "3 Human Resources 79183\n", "4 Engineering 164384\n", "5 Sales 138973\n", "6 Corporate 152011" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    \n", "\t
  1. 376
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 376\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 376\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 376 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employeesSub <- filter(employees, \n", " Retired == FALSE, Start_Date >= \"1995-01-01\", Division != \"Operations\")\n", "head(employeesSub)\n", "dim(employeesSub)" ] }, { "cell_type": "code", "execution_count": 6, "id": "b0f1e781", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 376
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 376\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 376\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 376 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dim(employeesSub)" ] }, { "cell_type": "code", "execution_count": 7, "id": "286519d9", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 376 10\n" ] } ], "source": [ "print(dim(employeesSub))" ] }, { "cell_type": "markdown", "id": "1a6fdf1c", "metadata": {}, "source": [ "When you list multiple conditions in `filter()`, those conditions are combined with \"and\". In the previous example, our new data frame `employeesSub` contains the `r nrow(employeesSub)` employees who are still working, *and* who started on or after January 1, 1995, *and* who are not in human resources. \n", "\n", "However, you might want to filter based on one condition *or* another. For example, imagine we wanted to find all employees who have a master's, *or* who started on or before December 31, 2000, *or* who make less than $100,000. To do this, instead of listing each condition as a separate argument, we combine the conditions with the `|` character, which evaluates to \"or\". For example:" ] }, { "cell_type": "code", "execution_count": 8, "id": "1f43e29a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA
3570 Troftgruben, TierraFemale 47 8 High School 1995-02-23 FALSE Operations 101138
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 6881 & al-Rahimi, Tayyiba & Female & 51 & 10 & High School & 1990-02-23 & FALSE & Operations & 108804 \\\\\n", "\t 8925 & el-Jaffer, Manaal & Female & 50 & 10 & Master's & 1991-02-23 & FALSE & Engineering & 206770 \\\\\n", "\t 2769 & Soto, Michael & Male & 52 & 10 & High School & 1987-02-23 & FALSE & Sales & 183407 \\\\\n", "\t 2658 & al-Ebrahimi, Mamoon & Male & 55 & 8 & Ph.D & 1985-02-23 & FALSE & Corporate & 236240 \\\\\n", "\t 1933 & Medina, Brandy & Female & 62 & 7 & Associate's & 1979-02-23 & TRUE & Sales & NA \\\\\n", "\t 3570 & Troftgruben, Tierra & Female & 47 & 8 & High School & 1995-02-23 & FALSE & Operations & 101138 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |\n", "| 8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |\n", "| 2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |\n", "| 2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |\n", "| 1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |\n", "| 3570 | Troftgruben, Tierra | Female | 47 | 8 | High School | 1995-02-23 | FALSE | Operations | 101138 |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE \n", "2 8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE \n", "3 2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE \n", "4 2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE \n", "5 1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE \n", "6 3570 Troftgruben, Tierra Female 47 8 High School 1995-02-23 FALSE \n", " Division Salary\n", "1 Operations 108804\n", "2 Engineering 206770\n", "3 Sales 183407\n", "4 Corporate 236240\n", "5 Sales NA\n", "6 Operations 101138" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    \n", "\t
  1. 741
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 741\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 741\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 741 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employeesSubOr <- filter(employees, \n", " Degree == \"Master's\" | Start_Date <= \"2000-12-31\" | Salary < 100000)\n", "head(employeesSubOr)\n", "dim(employeesSubOr)" ] }, { "cell_type": "code", "execution_count": 9, "id": "20f7d48d", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 741
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 741\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 741\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 741 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dim(employeesSubOr)" ] }, { "cell_type": "code", "execution_count": 10, "id": "cffa42eb", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 741 10\n" ] } ], "source": [ "print(dim(employeesSubOr))" ] }, { "cell_type": "markdown", "id": "d2ef30ba", "metadata": {}, "source": [ "Now let's create a data frame with the employees who have a Master's or a Ph.D. We could do this using the or operator `|`:" ] }, { "cell_type": "code", "execution_count": 11, "id": "87f2d4fb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
7821 Holleman, ShaquaishaFemale 43 8 Master's 1999-02-23 FALSE Human Resources 149468
9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE Engineering 164384
2687 Benavidez, Juan Male 45 8 Ph.D 1997-02-23 FALSE Engineering 181924
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 2671 & Lewis, Austin & Male & 34 & 4 & Ph.D & 2007-02-23 & FALSE & Engineering & 182343 \\\\\n", "\t 8925 & el-Jaffer, Manaal & Female & 50 & 10 & Master's & 1991-02-23 & FALSE & Engineering & 206770 \\\\\n", "\t 2658 & al-Ebrahimi, Mamoon & Male & 55 & 8 & Ph.D & 1985-02-23 & FALSE & Corporate & 236240 \\\\\n", "\t 7821 & Holleman, Shaquaisha & Female & 43 & 8 & Master's & 1999-02-23 & FALSE & Human Resources & 149468 \\\\\n", "\t 9871 & Sontag, Taylor & Female & 30 & 7 & Ph.D & 2011-02-23 & FALSE & Engineering & 164384 \\\\\n", "\t 2687 & Benavidez, Juan & Male & 45 & 8 & Ph.D & 1997-02-23 & FALSE & Engineering & 181924 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |\n", "| 8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |\n", "| 2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |\n", "| 7821 | Holleman, Shaquaisha | Female | 43 | 8 | Master's | 1999-02-23 | FALSE | Human Resources | 149468 |\n", "| 9871 | Sontag, Taylor | Female | 30 | 7 | Ph.D | 2011-02-23 | FALSE | Engineering | 164384 |\n", "| 2687 | Benavidez, Juan | Male | 45 | 8 | Ph.D | 1997-02-23 | FALSE | Engineering | 181924 |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE \n", "2 8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE \n", "3 2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE \n", "4 7821 Holleman, Shaquaisha Female 43 8 Master's 1999-02-23 FALSE \n", "5 9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE \n", "6 2687 Benavidez, Juan Male 45 8 Ph.D 1997-02-23 FALSE \n", " Division Salary\n", "1 Engineering 182343\n", "2 Engineering 206770\n", "3 Corporate 236240\n", "4 Human Resources 149468\n", "5 Engineering 164384\n", "6 Engineering 181924" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    \n", "\t
  1. 400
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 400\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 400\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 400 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employeesGrad <- filter(employees, Degree == \"Master's\" | Degree == \"Ph.D\")\n", "head(employeesGrad)\n", "dim(employeesGrad)" ] }, { "cell_type": "code", "execution_count": 12, "id": "8e32fdae", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 400
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 400\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 400\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 400 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dim(employeesGrad)" ] }, { "cell_type": "code", "execution_count": 13, "id": "7848ec83", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 400 10\n" ] } ], "source": [ "print(dim(employeesGrad))" ] }, { "cell_type": "markdown", "id": "96b55d43", "metadata": {}, "source": [ "Now imagine we wanted to find all employees who have a Master's, a Ph.D, or a Bachelor's. We could add another `|` to our condition and specify that `Degree == \"Bachelor's\"`. Alternatively, we could make our code more compact by re-writing the condition as `var_name %in% values`. This will filter to only those rows where `var_name` is equal to one of the values specified in the atomic vector `values`. For example:" ] }, { "cell_type": "code", "execution_count": 14, "id": "84c03d6f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
7821 Holleman, ShaquaishaFemale 43 8 Master's 1999-02-23 FALSE Human Resources 149468
5915 Rogers, Arielle Female 27 2 Bachelor's 2013-02-23 FALSE Human Resources 79183
9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE Engineering 164384
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 2671 & Lewis, Austin & Male & 34 & 4 & Ph.D & 2007-02-23 & FALSE & Engineering & 182343 \\\\\n", "\t 8925 & el-Jaffer, Manaal & Female & 50 & 10 & Master's & 1991-02-23 & FALSE & Engineering & 206770 \\\\\n", "\t 2658 & al-Ebrahimi, Mamoon & Male & 55 & 8 & Ph.D & 1985-02-23 & FALSE & Corporate & 236240 \\\\\n", "\t 7821 & Holleman, Shaquaisha & Female & 43 & 8 & Master's & 1999-02-23 & FALSE & Human Resources & 149468 \\\\\n", "\t 5915 & Rogers, Arielle & Female & 27 & 2 & Bachelor's & 2013-02-23 & FALSE & Human Resources & 79183 \\\\\n", "\t 9871 & Sontag, Taylor & Female & 30 & 7 & Ph.D & 2011-02-23 & FALSE & Engineering & 164384 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |\n", "| 8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |\n", "| 2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |\n", "| 7821 | Holleman, Shaquaisha | Female | 43 | 8 | Master's | 1999-02-23 | FALSE | Human Resources | 149468 |\n", "| 5915 | Rogers, Arielle | Female | 27 | 2 | Bachelor's | 2013-02-23 | FALSE | Human Resources | 79183 |\n", "| 9871 | Sontag, Taylor | Female | 30 | 7 | Ph.D | 2011-02-23 | FALSE | Engineering | 164384 |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE \n", "2 8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE \n", "3 2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE \n", "4 7821 Holleman, Shaquaisha Female 43 8 Master's 1999-02-23 FALSE \n", "5 5915 Rogers, Arielle Female 27 2 Bachelor's 2013-02-23 FALSE \n", "6 9871 Sontag, Taylor Female 30 7 Ph.D 2011-02-23 FALSE \n", " Division Salary\n", "1 Engineering 182343\n", "2 Engineering 206770\n", "3 Corporate 236240\n", "4 Human Resources 149468\n", "5 Human Resources 79183\n", "6 Engineering 164384" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    \n", "\t
  1. 600
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 600\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 600\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 600 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employeesCollege <- filter(employees, Degree %in% c(\"Bachelor's\", \"Master's\", \"Ph.D\"))\n", "head(employeesCollege)\n", "dim(employeesCollege)" ] }, { "cell_type": "code", "execution_count": 15, "id": "24a26596", "metadata": { "tags": [ "remove-output" ] }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 600
  2. \n", "\t
  3. 10
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 600\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 600\n", "2. 10\n", "\n", "\n" ], "text/plain": [ "[1] 600 10" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dim(employeesCollege)" ] }, { "cell_type": "code", "execution_count": 16, "id": "40be76da", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 600 10\n" ] } ], "source": [ "print(dim(employeesCollege))" ] }, { "cell_type": "markdown", "id": "f6998fd6", "metadata": {}, "source": [ "```{warning}\n", "Be careful filtering data when you have missing values (`NA`).\n", "```\n", "\n", "The `filter()` function keeps only those rows where the specified condition(s) evaluate(s) to `TRUE`. This is complicated by the presence of missing values, as it is impossible to determine whether a condition is `TRUE` or `FALSE` if the relevant information is missing. In our employees data set, we are missing a `Salary` for around 5% of the employees. If we wanted to filter to only those individuals who made more than \\$100,000, how would `filter()` treat the 5% of employees with `NA` values for `Salary`? Because the condition does not evaluate to `TRUE` for these rows, they are dropped." ] }, { "cell_type": "code", "execution_count": 17, "id": "befed129", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
3570 Troftgruben, TierraFemale 47 8 High School 1995-02-23 FALSE Operations 101138
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 6881 & al-Rahimi, Tayyiba & Female & 51 & 10 & High School & 1990-02-23 & FALSE & Operations & 108804 \\\\\n", "\t 2671 & Lewis, Austin & Male & 34 & 4 & Ph.D & 2007-02-23 & FALSE & Engineering & 182343 \\\\\n", "\t 8925 & el-Jaffer, Manaal & Female & 50 & 10 & Master's & 1991-02-23 & FALSE & Engineering & 206770 \\\\\n", "\t 2769 & Soto, Michael & Male & 52 & 10 & High School & 1987-02-23 & FALSE & Sales & 183407 \\\\\n", "\t 2658 & al-Ebrahimi, Mamoon & Male & 55 & 8 & Ph.D & 1985-02-23 & FALSE & Corporate & 236240 \\\\\n", "\t 3570 & Troftgruben, Tierra & Female & 47 & 8 & High School & 1995-02-23 & FALSE & Operations & 101138 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |\n", "| 2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |\n", "| 8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |\n", "| 2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |\n", "| 2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |\n", "| 3570 | Troftgruben, Tierra | Female | 47 | 8 | High School | 1995-02-23 | FALSE | Operations | 101138 |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE \n", "2 2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE \n", "3 8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE \n", "4 2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE \n", "5 2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE \n", "6 3570 Troftgruben, Tierra Female 47 8 High School 1995-02-23 FALSE \n", " Division Salary\n", "1 Operations 108804\n", "2 Engineering 182343\n", "3 Engineering 206770\n", "4 Sales 183407\n", "5 Corporate 236240\n", "6 Operations 101138" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employees100k <- filter(employees, Salary >= 100000)\n", "head(employees100k)" ] }, { "cell_type": "markdown", "id": "7b6bbb11", "metadata": {}, "source": [ "In this example, `employees100k` would *not* include any of the employees whose salary is unknown. Note that R provides no warning that these rows are being excluded as well; it is up to the user to recognize that the data contains missing values, and that this will affect how data are filtered. \n", "\n", "If you did not want R to drop the missing values, you could explicitly state in the condition to keep all rows where `Salary` is greater than or equal to $100,000, *or* where `Salary` is missing. As we saw before, we can combine conditions with \"or\" using the `|` character. It may be tempting to assume we should add `Salary == NA` in order to capture the rows where `Salary` is missing; however, this will not work. We cannot use logical operators like `==` to compare something to `NA`. Instead have to use the `is.na()`, as follows:" ] }, { "cell_type": "code", "execution_count": 18, "id": "4c47d93e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 6881 & al-Rahimi, Tayyiba & Female & 51 & 10 & High School & 1990-02-23 & FALSE & Operations & 108804 \\\\\n", "\t 2671 & Lewis, Austin & Male & 34 & 4 & Ph.D & 2007-02-23 & FALSE & Engineering & 182343 \\\\\n", "\t 8925 & el-Jaffer, Manaal & Female & 50 & 10 & Master's & 1991-02-23 & FALSE & Engineering & 206770 \\\\\n", "\t 2769 & Soto, Michael & Male & 52 & 10 & High School & 1987-02-23 & FALSE & Sales & 183407 \\\\\n", "\t 2658 & al-Ebrahimi, Mamoon & Male & 55 & 8 & Ph.D & 1985-02-23 & FALSE & Corporate & 236240 \\\\\n", "\t 1933 & Medina, Brandy & Female & 62 & 7 & Associate's & 1979-02-23 & TRUE & Sales & NA \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |\n", "| 2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |\n", "| 8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |\n", "| 2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |\n", "| 2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |\n", "| 1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE \n", "2 2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE \n", "3 8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE \n", "4 2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE \n", "5 2658 al-Ebrahimi, Mamoon Male 55 8 Ph.D 1985-02-23 FALSE \n", "6 1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE \n", " Division Salary\n", "1 Operations 108804\n", "2 Engineering 182343\n", "3 Engineering 206770\n", "4 Sales 183407\n", "5 Corporate 236240\n", "6 Sales NA" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employees100kNA <- filter(employees, Salary >= 100000 | is.na(Salary))\n", "head(employees100kNA)" ] }, { "cell_type": "code", "execution_count": 19, "id": "b0f2b17a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
5271 Leschinsky, Johanna Female 36 4 Ph.D 2004-02-23 FALSE Corporate 160364
6681 Bruns, Austin Male 41 8 Ph.D 2002-02-23 FALSE Engineering 188656
2031 Martinez, Caleb Male 57 8 Ph.D 1984-02-23 FALSE Engineering 218430
3239 Larson, Trusten Male 37 5 Bachelor's 2002-02-23 FALSE Human Resources 149789
3717 Levy-Minter, QuintinMale 53 10 Bachelor's 1989-02-23 FALSE Operations 172703
4209 Dena, Gianna Female 49 6 Master's 1991-02-23 FALSE Accounting 185445
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " ID & Name & Gender & Age & Rating & Degree & Start\\_Date & Retired & Division & Salary\\\\\n", "\\hline\n", "\t 5271 & Leschinsky, Johanna & Female & 36 & 4 & Ph.D & 2004-02-23 & FALSE & Corporate & 160364 \\\\\n", "\t 6681 & Bruns, Austin & Male & 41 & 8 & Ph.D & 2002-02-23 & FALSE & Engineering & 188656 \\\\\n", "\t 2031 & Martinez, Caleb & Male & 57 & 8 & Ph.D & 1984-02-23 & FALSE & Engineering & 218430 \\\\\n", "\t 3239 & Larson, Trusten & Male & 37 & 5 & Bachelor's & 2002-02-23 & FALSE & Human Resources & 149789 \\\\\n", "\t 3717 & Levy-Minter, Quintin & Male & 53 & 10 & Bachelor's & 1989-02-23 & FALSE & Operations & 172703 \\\\\n", "\t 4209 & Dena, Gianna & Female & 49 & 6 & Master's & 1991-02-23 & FALSE & Accounting & 185445 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |\n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 5271 | Leschinsky, Johanna | Female | 36 | 4 | Ph.D | 2004-02-23 | FALSE | Corporate | 160364 |\n", "| 6681 | Bruns, Austin | Male | 41 | 8 | Ph.D | 2002-02-23 | FALSE | Engineering | 188656 |\n", "| 2031 | Martinez, Caleb | Male | 57 | 8 | Ph.D | 1984-02-23 | FALSE | Engineering | 218430 |\n", "| 3239 | Larson, Trusten | Male | 37 | 5 | Bachelor's | 2002-02-23 | FALSE | Human Resources | 149789 |\n", "| 3717 | Levy-Minter, Quintin | Male | 53 | 10 | Bachelor's | 1989-02-23 | FALSE | Operations | 172703 |\n", "| 4209 | Dena, Gianna | Female | 49 | 6 | Master's | 1991-02-23 | FALSE | Accounting | 185445 |\n", "\n" ], "text/plain": [ " ID Name Gender Age Rating Degree Start_Date Retired\n", "1 5271 Leschinsky, Johanna Female 36 4 Ph.D 2004-02-23 FALSE \n", "2 6681 Bruns, Austin Male 41 8 Ph.D 2002-02-23 FALSE \n", "3 2031 Martinez, Caleb Male 57 8 Ph.D 1984-02-23 FALSE \n", "4 3239 Larson, Trusten Male 37 5 Bachelor's 2002-02-23 FALSE \n", "5 3717 Levy-Minter, Quintin Male 53 10 Bachelor's 1989-02-23 FALSE \n", "6 4209 Dena, Gianna Female 49 6 Master's 1991-02-23 FALSE \n", " Division Salary\n", "1 Corporate 160364\n", "2 Engineering 188656\n", "3 Engineering 218430\n", "4 Human Resources 149789\n", "5 Operations 172703\n", "6 Accounting 185445" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tail(employees100kNA)" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.11.5" } }, "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" }, "source_map": [ 16, 22, 46, 52, 57, 60, 68, 75, 80, 83, 89, 96, 101, 104, 108, 114, 119, 122, 126, 132, 137, 140, 148, 151, 157, 162 ] }, "nbformat": 4, "nbformat_minor": 5 }