2.2. Summarising Data

Now that we know how to use the pipe, we can use it to quickly and efficiently summarise data. To start we first need to introduce the summarise() function from the tidyverse, which we can use to summarise one or more columns in a data frame. This function uses the following syntax:

Syntax

tidyverse::summarise(df, summaryStat1 = ..., summaryStat2 = ..., ...)

  • Required arguments

    • df: The data frame with the data.

    • summaryStat1 = ...: The summary statistic we would like to calculate.

  • Optional arguments

    • summaryStat2 = ..., ...: Any additional summary statistics we would like to calculate.

For example, we can use summarise() to calculate all of the following at once from employees:

  • The average of Salary

  • The standard deviation of Salary

  • The minimum Age

  • The maximum Age

summarise(employees,  meanSalary = mean(Salary, na.rm = TRUE),
                          sdSalary = sd(Salary, na.rm = TRUE),
                          minAge = min(Age),
                          maxAge = max(Age))
meanSalarysdSalaryminAgemaxAge
158034.339677.0225 65

It is often useful to include the helper function n() within summarise(), which will calculate the number of observations in the data set. Note that this is similar to the nrow() function that we saw in the bootcamp, but n() only works within other tidyverse functions.

summarise(employees,  meanSalary = mean(Salary, na.rm = TRUE),
                          sdSalary = sd(Salary, na.rm = TRUE),
                          minAge = min(Age),
                          maxAge = max(Age),
                          nObs = n())
meanSalarysdSalaryminAgemaxAgenObs
158034.339677.0225 65 908

The summarise() function is useful for calculating summary statistics, but it becomes even more powerful when we combine it with group_by().

Syntax

tidyverse::group_by()

Imagine that we wanted to calculate separate summary statistics for each of the three offices (New York, Boston, and Detroit) separately, not across the entire data set. To accomplish this, we can use the pipe to pass the data through group_by() first, then pass it through summarise(). Any variable(s) we specify in group_by() will be used to separate the data into distinct groups, and summarise() will be applied to each one of those groups separately. For example:

employees %>%

  group_by(office) %>%

  summarise(meanSalary = mean(Salary, na.rm=TRUE),
            sdSalary = sd(Salary, na.rm=TRUE),
            minAge = min(Age),
            maxAge = max(Age),
            nObs = n())
officemeanSalarysdSalaryminAgemaxAgenObs
Boston 157957.937388.5725 65 294
Detroit 137587.238510.3925 65 166
New York165628.438978.0825 65 448

From the output we can see that this calculate the summary statistics within each value of the office variable. We can also include more than one variable within group_by(). For example, imagine we wanted to calculate these summary statistics by gender within each office. All we would need to do is add Gender to the group_by():

employees %>%

  group_by(office, Gender) %>%

  summarise(meanSalary = mean(Salary, na.rm=TRUE),
            sdSalary = sd(Salary, na.rm=TRUE),
            minAge = min(Age),
            maxAge = max(Age),
            nObs = n())
officeGendermeanSalarysdSalaryminAgemaxAgenObs
Boston Female 152778.134104.5325 65 114
Boston Male 161317.039106.5625 65 180
Detroit Female 133720.135552.3925 65 69
Detroit Male 140251.240401.4125 64 97
New YorkFemale 160560.339787.9825 65 220
New YorkMale 170647.537584.8225 65 228