{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "89b5438f", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Registered S3 methods overwritten by 'ggplot2':\n", " method from \n", " [.quosures rlang\n", " c.quosures rlang\n", " print.quosures rlang\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Registered S3 method overwritten by 'rvest':\n", " method from\n", " read_xml.response xml2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "-- Attaching packages --------------------------------------- tidyverse 1.2.1 --\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "v ggplot2 3.1.1 v purrr 0.3.2 \n", "v tibble 2.1.1 v dplyr 0.8.0.1\n", "v tidyr 0.8.3 v stringr 1.4.0 \n", "v readr 1.3.1 v forcats 0.4.0 \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "-- Conflicts ------------------------------------------ tidyverse_conflicts() --\n", "x dplyr::filter() masks stats::filter()\n", "x dplyr::lag() masks stats::lag()\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Parsed with column specification:\n", "cols(\n", " ID = col_double(),\n", " Name = col_character(),\n", " Gender = col_character(),\n", " Age = col_double(),\n", " Rating = col_double(),\n", " Degree = col_character(),\n", " Start_Date = col_character(),\n", " Retired = col_logical(),\n", " Division = col_character(),\n", " Salary = col_character()\n", ")\n" ] } ], "source": [ "library(tidyverse)\n", "employees <- read_csv(\"../_build/data/employee_data.csv\")\n", "employees$Salary <- parse_number(employees$Salary)\n", "employees$Start_Date <- parse_date(employees$Start_Date, format = \"%m/%d/%Y\")\n", "degreeLevels <- c(\"High School\", \"Associate's\", \"Bachelor's\", \"Master's\", \"Ph.D\")\n", "employees$Degree <- parse_factor(employees$Degree, levels = degreeLevels, ordered = TRUE)\n", "offices <- read.csv(\"../_build/data/office.csv\")\n", "employees <- inner_join(employees, offices, by=\"ID\")" ] }, { "cell_type": "markdown", "id": "658969c7", "metadata": {}, "source": [ "# Summarising Data\n", "\n", "Now that we know how to use the pipe, we can use it to quickly and efficiently summarise data. To start we first need to introduce the `summarise()` function from the `tidyverse`, which we can use to summarise one or more columns in a data frame. This function uses the following syntax:\n", "\n", "```{admonition} Syntax\n", "`tidyverse::summarise(df, summaryStat1 = ..., summaryStat2 = ..., ...)`\n", "+ *Required arguments*\n", " - `df`: The data frame with the data. \n", " - `summaryStat1 = ...`: The summary statistic we would like to calculate.\n", "+ *Optional arguments*\n", " - `summaryStat2 = ..., ...`: Any additional summary statistics we would like to calculate.\n", "```\n", "\n", "For example, we can use `summarise()` to calculate all of the following at once from `employees`:\n", "\n", "+ The average of `Salary`\n", "+ The standard deviation of `Salary`\n", "+ The minimum `Age`\n", "+ The maximum `Age`" ] }, { "cell_type": "code", "execution_count": 2, "id": "88325c62", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\n", "
meanSalarysdSalaryminAgemaxAge
158034.339677.0225 65
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " meanSalary & sdSalary & minAge & maxAge\\\\\n", "\\hline\n", "\t 158034.3 & 39677.02 & 25 & 65 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| meanSalary | sdSalary | minAge | maxAge |\n", "|---|---|---|---|\n", "| 158034.3 | 39677.02 | 25 | 65 |\n", "\n" ], "text/plain": [ " meanSalary sdSalary minAge maxAge\n", "1 158034.3 39677.02 25 65 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "summarise(employees, meanSalary = mean(Salary, na.rm = TRUE),\n", " sdSalary = sd(Salary, na.rm = TRUE),\n", " minAge = min(Age),\n", " maxAge = max(Age))" ] }, { "cell_type": "markdown", "id": "0498c861", "metadata": {}, "source": [ "It is often useful to include the helper function `n()` within `summarise()`, which will calculate the number of observations in the data set. Note that this is similar to the `nrow()` function that we saw in the bootcamp, but `n()` only works within other `tidyverse` functions." ] }, { "cell_type": "code", "execution_count": 3, "id": "aa675e6e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\n", "
meanSalarysdSalaryminAgemaxAgenObs
158034.339677.0225 65 908
\n" ], "text/latex": [ "\\begin{tabular}{r|lllll}\n", " meanSalary & sdSalary & minAge & maxAge & nObs\\\\\n", "\\hline\n", "\t 158034.3 & 39677.02 & 25 & 65 & 908 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| meanSalary | sdSalary | minAge | maxAge | nObs |\n", "|---|---|---|---|---|\n", "| 158034.3 | 39677.02 | 25 | 65 | 908 |\n", "\n" ], "text/plain": [ " meanSalary sdSalary minAge maxAge nObs\n", "1 158034.3 39677.02 25 65 908 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "summarise(employees, meanSalary = mean(Salary, na.rm = TRUE),\n", " sdSalary = sd(Salary, na.rm = TRUE),\n", " minAge = min(Age),\n", " maxAge = max(Age),\n", " nObs = n())" ] }, { "cell_type": "markdown", "id": "72a2c67c", "metadata": {}, "source": [ "The `summarise()` function is useful for calculating summary statistics, but it becomes even more powerful when we combine it with `group_by()`. \n", "\n", "```{admonition} Syntax\n", "`tidyverse::group_by()`\n", "```\n", "\n", "Imagine that we wanted to calculate separate summary statistics for each of the three offices (`New York`, `Boston`, and `Detroit`) separately, not across the entire data set. To accomplish this, we can use the pipe to pass the data through `group_by()` first, then pass it through `summarise()`. Any variable(s) we specify in `group_by()` will be used to separate the data into distinct groups, and `summarise()` will be applied to each one of those groups separately. For example:" ] }, { "cell_type": "code", "execution_count": 4, "id": "35c335e4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
officemeanSalarysdSalaryminAgemaxAgenObs
Boston 157957.937388.5725 65 294
Detroit 137587.238510.3925 65 166
New York165628.438978.0825 65 448
\n" ], "text/latex": [ "\\begin{tabular}{r|llllll}\n", " office & meanSalary & sdSalary & minAge & maxAge & nObs\\\\\n", "\\hline\n", "\t Boston & 157957.9 & 37388.57 & 25 & 65 & 294 \\\\\n", "\t Detroit & 137587.2 & 38510.39 & 25 & 65 & 166 \\\\\n", "\t New York & 165628.4 & 38978.08 & 25 & 65 & 448 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| office | meanSalary | sdSalary | minAge | maxAge | nObs |\n", "|---|---|---|---|---|---|\n", "| Boston | 157957.9 | 37388.57 | 25 | 65 | 294 |\n", "| Detroit | 137587.2 | 38510.39 | 25 | 65 | 166 |\n", "| New York | 165628.4 | 38978.08 | 25 | 65 | 448 |\n", "\n" ], "text/plain": [ " office meanSalary sdSalary minAge maxAge nObs\n", "1 Boston 157957.9 37388.57 25 65 294 \n", "2 Detroit 137587.2 38510.39 25 65 166 \n", "3 New York 165628.4 38978.08 25 65 448 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employees %>%\n", "\n", " group_by(office) %>%\n", "\n", " summarise(meanSalary = mean(Salary, na.rm=TRUE),\n", " sdSalary = sd(Salary, na.rm=TRUE),\n", " minAge = min(Age),\n", " maxAge = max(Age),\n", " nObs = n())" ] }, { "cell_type": "markdown", "id": "db707170", "metadata": {}, "source": [ "From the output we can see that this calculate the summary statistics within each value of the `office` variable. We can also include more than one variable within `group_by()`. For example, imagine we wanted to calculate these summary statistics by gender within each office. All we would need to do is add `Gender` to the `group_by()`:" ] }, { "cell_type": "code", "execution_count": 5, "id": "b064f0a2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
officeGendermeanSalarysdSalaryminAgemaxAgenObs
Boston Female 152778.134104.5325 65 114
Boston Male 161317.039106.5625 65 180
Detroit Female 133720.135552.3925 65 69
Detroit Male 140251.240401.4125 64 97
New YorkFemale 160560.339787.9825 65 220
New YorkMale 170647.537584.8225 65 228
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllll}\n", " office & Gender & meanSalary & sdSalary & minAge & maxAge & nObs\\\\\n", "\\hline\n", "\t Boston & Female & 152778.1 & 34104.53 & 25 & 65 & 114 \\\\\n", "\t Boston & Male & 161317.0 & 39106.56 & 25 & 65 & 180 \\\\\n", "\t Detroit & Female & 133720.1 & 35552.39 & 25 & 65 & 69 \\\\\n", "\t Detroit & Male & 140251.2 & 40401.41 & 25 & 64 & 97 \\\\\n", "\t New York & Female & 160560.3 & 39787.98 & 25 & 65 & 220 \\\\\n", "\t New York & Male & 170647.5 & 37584.82 & 25 & 65 & 228 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| office | Gender | meanSalary | sdSalary | minAge | maxAge | nObs |\n", "|---|---|---|---|---|---|---|\n", "| Boston | Female | 152778.1 | 34104.53 | 25 | 65 | 114 |\n", "| Boston | Male | 161317.0 | 39106.56 | 25 | 65 | 180 |\n", "| Detroit | Female | 133720.1 | 35552.39 | 25 | 65 | 69 |\n", "| Detroit | Male | 140251.2 | 40401.41 | 25 | 64 | 97 |\n", "| New York | Female | 160560.3 | 39787.98 | 25 | 65 | 220 |\n", "| New York | Male | 170647.5 | 37584.82 | 25 | 65 | 228 |\n", "\n" ], "text/plain": [ " office Gender meanSalary sdSalary minAge maxAge nObs\n", "1 Boston Female 152778.1 34104.53 25 65 114 \n", "2 Boston Male 161317.0 39106.56 25 65 180 \n", "3 Detroit Female 133720.1 35552.39 25 65 69 \n", "4 Detroit Male 140251.2 40401.41 25 64 97 \n", "5 New York Female 160560.3 39787.98 25 65 220 \n", "6 New York Male 170647.5 37584.82 25 65 228 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "employees %>%\n", "\n", " group_by(office, Gender) %>%\n", "\n", " summarise(meanSalary = mean(Salary, na.rm=TRUE),\n", " sdSalary = sd(Salary, na.rm=TRUE),\n", " minAge = min(Age),\n", " maxAge = max(Age),\n", " nObs = n())" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.11.5" } }, "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" }, "source_map": [ 16, 26, 48, 53, 57, 63, 73, 83, 87 ] }, "nbformat": 4, "nbformat_minor": 5 }