2.2. Summarising Data¶

Now that we know how to use the pipe, we can use it to quickly and efficiently summarise data. To start we first need to introduce the summarise() function from the tidyverse, which we can use to summarise one or more columns in a data frame. This function uses the following syntax:

Syntax

tidyverse::summarise(df, summaryStat1 = ..., summaryStat2 = ..., ...)

Required arguments
- df: The data frame with the data.
- summaryStat1 = ...: The summary statistic we would like to calculate.
Optional arguments
- summaryStat2 = ..., ...: Any additional summary statistics we would like to calculate.

For example, we can use summarise() to calculate all of the following at once from employees:

The average of Salary
The standard deviation of Salary
The minimum Age
The maximum Age

summarise(employees,  meanSalary = mean(Salary, na.rm = TRUE),
                          sdSalary = sd(Salary, na.rm = TRUE),
                          minAge = min(Age),
                          maxAge = max(Age))

meanSalary	sdSalary	minAge	maxAge
158034.3	39677.02	25	65

It is often useful to include the helper function n() within summarise(), which will calculate the number of observations in the data set. Note that this is similar to the nrow() function that we saw in the bootcamp, but n() only works within other tidyverse functions.

summarise(employees,  meanSalary = mean(Salary, na.rm = TRUE),
                          sdSalary = sd(Salary, na.rm = TRUE),
                          minAge = min(Age),
                          maxAge = max(Age),
                          nObs = n())

meanSalary	sdSalary	minAge	maxAge	nObs
158034.3	39677.02	25	65	908

The summarise() function is useful for calculating summary statistics, but it becomes even more powerful when we combine it with group_by().

Syntax

tidyverse::group_by()

Imagine that we wanted to calculate separate summary statistics for each of the three offices (New York, Boston, and Detroit) separately, not across the entire data set. To accomplish this, we can use the pipe to pass the data through group_by() first, then pass it through summarise(). Any variable(s) we specify in group_by() will be used to separate the data into distinct groups, and summarise() will be applied to each one of those groups separately. For example:

employees %>%

  group_by(office) %>%

  summarise(meanSalary = mean(Salary, na.rm=TRUE),
            sdSalary = sd(Salary, na.rm=TRUE),
            minAge = min(Age),
            maxAge = max(Age),
            nObs = n())

office	meanSalary	sdSalary	minAge	maxAge	nObs
Boston	157957.9	37388.57	25	65	294
Detroit	137587.2	38510.39	25	65	166
New York	165628.4	38978.08	25	65	448

From the output we can see that this calculate the summary statistics within each value of the office variable. We can also include more than one variable within group_by(). For example, imagine we wanted to calculate these summary statistics by gender within each office. All we would need to do is add Gender to the group_by():

employees %>%

  group_by(office, Gender) %>%

  summarise(meanSalary = mean(Salary, na.rm=TRUE),
            sdSalary = sd(Salary, na.rm=TRUE),
            minAge = min(Age),
            maxAge = max(Age),
            nObs = n())

office	Gender	meanSalary	sdSalary	minAge	maxAge	nObs
Boston	Female	152778.1	34104.53	25	65	114
Boston	Male	161317.0	39106.56	25	65	180
Detroit	Female	133720.1	35552.39	25	65	69
Detroit	Male	140251.2	40401.41	25	64	97
New York	Female	160560.3	39787.98	25	65	220
New York	Male	170647.5	37584.82	25	65	228

Data Science for Managers

2.2. Summarising Data¶