{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "89b5438f",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Registered S3 methods overwritten by 'ggplot2':\n",
" method from \n",
" [.quosures rlang\n",
" c.quosures rlang\n",
" print.quosures rlang\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Registered S3 method overwritten by 'rvest':\n",
" method from\n",
" read_xml.response xml2\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"-- Attaching packages --------------------------------------- tidyverse 1.2.1 --\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"v ggplot2 3.1.1 v purrr 0.3.2 \n",
"v tibble 2.1.1 v dplyr 0.8.0.1\n",
"v tidyr 0.8.3 v stringr 1.4.0 \n",
"v readr 1.3.1 v forcats 0.4.0 \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"-- Conflicts ------------------------------------------ tidyverse_conflicts() --\n",
"x dplyr::filter() masks stats::filter()\n",
"x dplyr::lag() masks stats::lag()\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Parsed with column specification:\n",
"cols(\n",
" ID = col_double(),\n",
" Name = col_character(),\n",
" Gender = col_character(),\n",
" Age = col_double(),\n",
" Rating = col_double(),\n",
" Degree = col_character(),\n",
" Start_Date = col_character(),\n",
" Retired = col_logical(),\n",
" Division = col_character(),\n",
" Salary = col_character()\n",
")\n"
]
}
],
"source": [
"library(tidyverse)\n",
"employees <- read_csv(\"../_build/data/employee_data.csv\")\n",
"employees$Salary <- parse_number(employees$Salary)\n",
"employees$Start_Date <- parse_date(employees$Start_Date, format = \"%m/%d/%Y\")\n",
"degreeLevels <- c(\"High School\", \"Associate's\", \"Bachelor's\", \"Master's\", \"Ph.D\")\n",
"employees$Degree <- parse_factor(employees$Degree, levels = degreeLevels, ordered = TRUE)\n",
"offices <- read.csv(\"../_build/data/office.csv\")\n",
"employees <- inner_join(employees, offices, by=\"ID\")"
]
},
{
"cell_type": "markdown",
"id": "658969c7",
"metadata": {},
"source": [
"# Summarising Data\n",
"\n",
"Now that we know how to use the pipe, we can use it to quickly and efficiently summarise data. To start we first need to introduce the `summarise()` function from the `tidyverse`, which we can use to summarise one or more columns in a data frame. This function uses the following syntax:\n",
"\n",
"```{admonition} Syntax\n",
"`tidyverse::summarise(df, summaryStat1 = ..., summaryStat2 = ..., ...)`\n",
"+ *Required arguments*\n",
" - `df`: The data frame with the data. \n",
" - `summaryStat1 = ...`: The summary statistic we would like to calculate.\n",
"+ *Optional arguments*\n",
" - `summaryStat2 = ..., ...`: Any additional summary statistics we would like to calculate.\n",
"```\n",
"\n",
"For example, we can use `summarise()` to calculate all of the following at once from `employees`:\n",
"\n",
"+ The average of `Salary`\n",
"+ The standard deviation of `Salary`\n",
"+ The minimum `Age`\n",
"+ The maximum `Age`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "88325c62",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"meanSalary | sdSalary | minAge | maxAge |
\n",
"\n",
"\t158034.3 | 39677.02 | 25 | 65 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" meanSalary & sdSalary & minAge & maxAge\\\\\n",
"\\hline\n",
"\t 158034.3 & 39677.02 & 25 & 65 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| meanSalary | sdSalary | minAge | maxAge |\n",
"|---|---|---|---|\n",
"| 158034.3 | 39677.02 | 25 | 65 |\n",
"\n"
],
"text/plain": [
" meanSalary sdSalary minAge maxAge\n",
"1 158034.3 39677.02 25 65 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"summarise(employees, meanSalary = mean(Salary, na.rm = TRUE),\n",
" sdSalary = sd(Salary, na.rm = TRUE),\n",
" minAge = min(Age),\n",
" maxAge = max(Age))"
]
},
{
"cell_type": "markdown",
"id": "0498c861",
"metadata": {},
"source": [
"It is often useful to include the helper function `n()` within `summarise()`, which will calculate the number of observations in the data set. Note that this is similar to the `nrow()` function that we saw in the bootcamp, but `n()` only works within other `tidyverse` functions."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "aa675e6e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"meanSalary | sdSalary | minAge | maxAge | nObs |
\n",
"\n",
"\t158034.3 | 39677.02 | 25 | 65 | 908 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lllll}\n",
" meanSalary & sdSalary & minAge & maxAge & nObs\\\\\n",
"\\hline\n",
"\t 158034.3 & 39677.02 & 25 & 65 & 908 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| meanSalary | sdSalary | minAge | maxAge | nObs |\n",
"|---|---|---|---|---|\n",
"| 158034.3 | 39677.02 | 25 | 65 | 908 |\n",
"\n"
],
"text/plain": [
" meanSalary sdSalary minAge maxAge nObs\n",
"1 158034.3 39677.02 25 65 908 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"summarise(employees, meanSalary = mean(Salary, na.rm = TRUE),\n",
" sdSalary = sd(Salary, na.rm = TRUE),\n",
" minAge = min(Age),\n",
" maxAge = max(Age),\n",
" nObs = n())"
]
},
{
"cell_type": "markdown",
"id": "72a2c67c",
"metadata": {},
"source": [
"The `summarise()` function is useful for calculating summary statistics, but it becomes even more powerful when we combine it with `group_by()`. \n",
"\n",
"```{admonition} Syntax\n",
"`tidyverse::group_by()`\n",
"```\n",
"\n",
"Imagine that we wanted to calculate separate summary statistics for each of the three offices (`New York`, `Boston`, and `Detroit`) separately, not across the entire data set. To accomplish this, we can use the pipe to pass the data through `group_by()` first, then pass it through `summarise()`. Any variable(s) we specify in `group_by()` will be used to separate the data into distinct groups, and `summarise()` will be applied to each one of those groups separately. For example:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "35c335e4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"office | meanSalary | sdSalary | minAge | maxAge | nObs |
\n",
"\n",
"\tBoston | 157957.9 | 37388.57 | 25 | 65 | 294 |
\n",
"\tDetroit | 137587.2 | 38510.39 | 25 | 65 | 166 |
\n",
"\tNew York | 165628.4 | 38978.08 | 25 | 65 | 448 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llllll}\n",
" office & meanSalary & sdSalary & minAge & maxAge & nObs\\\\\n",
"\\hline\n",
"\t Boston & 157957.9 & 37388.57 & 25 & 65 & 294 \\\\\n",
"\t Detroit & 137587.2 & 38510.39 & 25 & 65 & 166 \\\\\n",
"\t New York & 165628.4 & 38978.08 & 25 & 65 & 448 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| office | meanSalary | sdSalary | minAge | maxAge | nObs |\n",
"|---|---|---|---|---|---|\n",
"| Boston | 157957.9 | 37388.57 | 25 | 65 | 294 |\n",
"| Detroit | 137587.2 | 38510.39 | 25 | 65 | 166 |\n",
"| New York | 165628.4 | 38978.08 | 25 | 65 | 448 |\n",
"\n"
],
"text/plain": [
" office meanSalary sdSalary minAge maxAge nObs\n",
"1 Boston 157957.9 37388.57 25 65 294 \n",
"2 Detroit 137587.2 38510.39 25 65 166 \n",
"3 New York 165628.4 38978.08 25 65 448 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"employees %>%\n",
"\n",
" group_by(office) %>%\n",
"\n",
" summarise(meanSalary = mean(Salary, na.rm=TRUE),\n",
" sdSalary = sd(Salary, na.rm=TRUE),\n",
" minAge = min(Age),\n",
" maxAge = max(Age),\n",
" nObs = n())"
]
},
{
"cell_type": "markdown",
"id": "db707170",
"metadata": {},
"source": [
"From the output we can see that this calculate the summary statistics within each value of the `office` variable. We can also include more than one variable within `group_by()`. For example, imagine we wanted to calculate these summary statistics by gender within each office. All we would need to do is add `Gender` to the `group_by()`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b064f0a2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"office | Gender | meanSalary | sdSalary | minAge | maxAge | nObs |
\n",
"\n",
"\tBoston | Female | 152778.1 | 34104.53 | 25 | 65 | 114 |
\n",
"\tBoston | Male | 161317.0 | 39106.56 | 25 | 65 | 180 |
\n",
"\tDetroit | Female | 133720.1 | 35552.39 | 25 | 65 | 69 |
\n",
"\tDetroit | Male | 140251.2 | 40401.41 | 25 | 64 | 97 |
\n",
"\tNew York | Female | 160560.3 | 39787.98 | 25 | 65 | 220 |
\n",
"\tNew York | Male | 170647.5 | 37584.82 | 25 | 65 | 228 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lllllll}\n",
" office & Gender & meanSalary & sdSalary & minAge & maxAge & nObs\\\\\n",
"\\hline\n",
"\t Boston & Female & 152778.1 & 34104.53 & 25 & 65 & 114 \\\\\n",
"\t Boston & Male & 161317.0 & 39106.56 & 25 & 65 & 180 \\\\\n",
"\t Detroit & Female & 133720.1 & 35552.39 & 25 & 65 & 69 \\\\\n",
"\t Detroit & Male & 140251.2 & 40401.41 & 25 & 64 & 97 \\\\\n",
"\t New York & Female & 160560.3 & 39787.98 & 25 & 65 & 220 \\\\\n",
"\t New York & Male & 170647.5 & 37584.82 & 25 & 65 & 228 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| office | Gender | meanSalary | sdSalary | minAge | maxAge | nObs |\n",
"|---|---|---|---|---|---|---|\n",
"| Boston | Female | 152778.1 | 34104.53 | 25 | 65 | 114 |\n",
"| Boston | Male | 161317.0 | 39106.56 | 25 | 65 | 180 |\n",
"| Detroit | Female | 133720.1 | 35552.39 | 25 | 65 | 69 |\n",
"| Detroit | Male | 140251.2 | 40401.41 | 25 | 64 | 97 |\n",
"| New York | Female | 160560.3 | 39787.98 | 25 | 65 | 220 |\n",
"| New York | Male | 170647.5 | 37584.82 | 25 | 65 | 228 |\n",
"\n"
],
"text/plain": [
" office Gender meanSalary sdSalary minAge maxAge nObs\n",
"1 Boston Female 152778.1 34104.53 25 65 114 \n",
"2 Boston Male 161317.0 39106.56 25 65 180 \n",
"3 Detroit Female 133720.1 35552.39 25 65 69 \n",
"4 Detroit Male 140251.2 40401.41 25 64 97 \n",
"5 New York Female 160560.3 39787.98 25 65 220 \n",
"6 New York Male 170647.5 37584.82 25 65 228 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"employees %>%\n",
"\n",
" group_by(office, Gender) %>%\n",
"\n",
" summarise(meanSalary = mean(Salary, na.rm=TRUE),\n",
" sdSalary = sd(Salary, na.rm=TRUE),\n",
" minAge = min(Age),\n",
" maxAge = max(Age),\n",
" nObs = n())"
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "md:myst",
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.11.5"
}
},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.6.1"
},
"source_map": [
16,
26,
48,
53,
57,
63,
73,
83,
87
]
},
"nbformat": 4,
"nbformat_minor": 5
}