2.2. Summarising Data¶
Now that we know how to use the pipe, we can use it to quickly and efficiently summarise data. To start we first need to introduce the summarise()
function from the tidyverse
, which we can use to summarise one or more columns in a data frame. This function uses the following syntax:
Syntax
tidyverse::summarise(df, summaryStat1 = ..., summaryStat2 = ..., ...)
Required arguments
df
: The data frame with the data.summaryStat1 = ...
: The summary statistic we would like to calculate.
Optional arguments
summaryStat2 = ..., ...
: Any additional summary statistics we would like to calculate.
For example, we can use summarise()
to calculate all of the following at once from employees
:
The average of
Salary
The standard deviation of
Salary
The minimum
Age
The maximum
Age
summarise(employees, meanSalary = mean(Salary, na.rm = TRUE),
sdSalary = sd(Salary, na.rm = TRUE),
minAge = min(Age),
maxAge = max(Age))
meanSalary | sdSalary | minAge | maxAge |
---|---|---|---|
158034.3 | 39677.02 | 25 | 65 |
It is often useful to include the helper function n()
within summarise()
, which will calculate the number of observations in the data set. Note that this is similar to the nrow()
function that we saw in the bootcamp, but n()
only works within other tidyverse
functions.
summarise(employees, meanSalary = mean(Salary, na.rm = TRUE),
sdSalary = sd(Salary, na.rm = TRUE),
minAge = min(Age),
maxAge = max(Age),
nObs = n())
meanSalary | sdSalary | minAge | maxAge | nObs |
---|---|---|---|---|
158034.3 | 39677.02 | 25 | 65 | 908 |
The summarise()
function is useful for calculating summary statistics, but it becomes even more powerful when we combine it with group_by()
.
Syntax
tidyverse::group_by()
Imagine that we wanted to calculate separate summary statistics for each of the three offices (New York
, Boston
, and Detroit
) separately, not across the entire data set. To accomplish this, we can use the pipe to pass the data through group_by()
first, then pass it through summarise()
. Any variable(s) we specify in group_by()
will be used to separate the data into distinct groups, and summarise()
will be applied to each one of those groups separately. For example:
employees %>%
group_by(office) %>%
summarise(meanSalary = mean(Salary, na.rm=TRUE),
sdSalary = sd(Salary, na.rm=TRUE),
minAge = min(Age),
maxAge = max(Age),
nObs = n())
office | meanSalary | sdSalary | minAge | maxAge | nObs |
---|---|---|---|---|---|
Boston | 157957.9 | 37388.57 | 25 | 65 | 294 |
Detroit | 137587.2 | 38510.39 | 25 | 65 | 166 |
New York | 165628.4 | 38978.08 | 25 | 65 | 448 |
From the output we can see that this calculate the summary statistics within each value of the office
variable. We can also include more than one variable within group_by()
. For example, imagine we wanted to calculate these summary statistics by gender within each office. All we would need to do is add Gender
to the group_by()
:
employees %>%
group_by(office, Gender) %>%
summarise(meanSalary = mean(Salary, na.rm=TRUE),
sdSalary = sd(Salary, na.rm=TRUE),
minAge = min(Age),
maxAge = max(Age),
nObs = n())
office | Gender | meanSalary | sdSalary | minAge | maxAge | nObs |
---|---|---|---|---|---|---|
Boston | Female | 152778.1 | 34104.53 | 25 | 65 | 114 |
Boston | Male | 161317.0 | 39106.56 | 25 | 65 | 180 |
Detroit | Female | 133720.1 | 35552.39 | 25 | 65 | 69 |
Detroit | Male | 140251.2 | 40401.41 | 25 | 64 | 97 |
New York | Female | 160560.3 | 39787.98 | 25 | 65 | 220 |
New York | Male | 170647.5 | 37584.82 | 25 | 65 | 228 |