1.2. Summary Statistics¶

The concept of summary statistics should be familiar to most readers. These are the summary measures that all organizations use to track key business outcomes. While visualizations can provide a visual indication of typical values, outliers, variations, trends, and associations in a data set, summary statistics provide unambiguous, numerical measures.

1.2.1. Quantitative Variables¶

As we explore a data set, it is often useful to calculate summary statistics that provide a preliminary overview of the variables we are working with. There are two main types of quantitative variables, continuous and discrete. A continuous variable is one that can take on any value within a certain numerical range. For example, in the employees data set Salary is continuous because it can take on any value above $0. A discrete variable can take on only a discrete set of values (e.g., 2,4,6,8). The most basic summary statistics for quantitative variables are measures of central tendency, such as:

The mode - the most frequent value in the data.
The mean - the average value, defined as the sum of all observations divided by the total number of observations. In mathematical notation, we would write this as:

\[\bar{x} = \frac{1}{n} \sum^{n}_{i=1}{x_i}\]

Conventionally, we have a data set with $n$ observations, and $x_i$ represents the value of the $i^{th}$ observation in that data set. In the formula above, the expression $\sum^{n}_{i=1}{x_i}$ means “the sum of the $n$ values of $x$ in the data set.” To get the mean ($\bar{x}$), we divide that total by $n$.

As we’ve seen before, we can calculate the mean in R with the mean() function.

mean(employees$Salary, na.rm = TRUE)

[1] 156486

The median & quartiles - when the data are sorted:
- The median is the middle value (i.e., the 50th percentile, also called the second quartile),
- The first quartile is the 25th percentile, and
- The third quartile is the 75th percentile.
We can calculate the median in R with the median() function.

Syntax

median(x, na.rm = FALSE)

Required arguments
- x: The atomic vector whose values one would like to find the median of.
Optional arguments
- na.rm: If TRUE, the function will remove any missing values (NAs) in the atomic vector and find the median of the non-missing values. If FALSE, the function does not remove NAs and will return a value of NA if there is an NA in the atomic vector.

median(employees$Salary, na.rm = TRUE)

[1] 156289.5

We can calculate the first and third quartiles with quantile().

Syntax

quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = FALSE)

Required arguments
- x: The atomic vector of values where we would like to find the quantiles.
Optional arguments
- probs: An atomic vector with the percentiles (between 0 and 1) we would like to calculate.
- na.rm: If TRUE, the function will remove any missing values (NAs) in the atomic vector and find the percentiles of the non-missing values. If FALSE, the function does not remove NAs and will return a value of NA if there is an NA in the atomic vector.

quantile(employees$Salary, na.rm = TRUE)

      0%      25%      50%      75%     100% 
 29825.0 129693.8 156289.5 184742.2 266235.0 

In addition to measures of central tendency, we often also want to measure the dispersion, or spread, of a data set. We can do that with the following measures:

The interquartile range (IQR) - the difference between the first quartile (i.e., the 25th percentile) and the third quartile (i.e., the 75th percentile).
The standard deviation & variance - variance and standard deviation are both measures of the spread of a data set. The minimum value of both measures is zero (which indicates no variation in the data), and the higher the values the more spread out the data are. The variance is calculated in squared units, while the standard deviation is recorded in the base units.
- Formally, the variance of a data set is written as:
\[s^2 = \frac{1}{n - 1}\sum^{n}_{i=1}{(x_i - \bar{x})^2}\]
- Although variance is an important concept in statistics, it does not provide a very intuitive understanding of the spread of a data set, because it is in squared units. Instead, we more commonly look at the standard deviation, which is the square root of the variance. This can be thought of as roughly the average distance observations in the data set fall from the mean. Formally, the standard deviation of a data set is written as:
\[s = \sqrt{\frac{1}{n - 1}\sum^{n}_{i=1}{(x_i - \bar{x})^2}}\]

We can calculate standard deviation and variance with the sd() and var() functions, respectively.

Syntax

sd(vectorName, na.rm = FALSE) & var(vectorName, na.rm = FALSE)

Required arguments
- The atomic vector whose values one would like to find the standard deviation/variance of.
Optional arguments
- na.rm: If TRUE, the function will remove any missing values (NAs) in the atomic vector and find the standard deviation/variance of the non-missing values. If FALSE, the function does not remove NAs and will return a value of NA if there is an NA in the atomic vector.

sd(employees$Salary, na.rm = TRUE)
var(employees$Salary, na.rm = TRUE)

[1] 39479.84

[1] 1558657479

1.2.2. Categorical Variables¶

Categorical variables can be summarized by calculating the count or proportion of observations that take on each value of the variable. For example, Degree in employees can only take on the values High School, Associate's, Bachelor's, Master's, and Ph.D. Categorical variables cannot be summarized by the mean, median, or standard deviation. Instead, these variables are often summarized using tables and bar plots. For categorical variables, the table() and prop.table() commands show the number and percentage (proportion) of observations in each category, respectively. Note that to use prop.table(), we need to apply table() first.

Syntax

table(x) & prop.table(table(x))

Required arguments
- x: The atomic vector of values.

table(employees$Division)

     Accounting       Corporate     Engineering Human Resources      Operations 
             63             103             236              97             287 
          Sales 
            214 

prop.table(table(employees$Division))

     Accounting       Corporate     Engineering Human Resources      Operations 
          0.063           0.103           0.236           0.097           0.287 
          Sales 
          0.214 

Two categorical variables can be summarized in a two-way table using the same table() and prop.table() commands shown above. For example:

table(employees$Division, employees$Degree)

                 
                  Associate's Bachelor's High School Master's Ph.D
  Accounting                0         31           0       32    0
  Corporate                 0         20           0       40   43
  Engineering               0         36           0       43  157
  Human Resources          35         30           0       32    0
  Operations              110         16         146       15    0
  Sales                    55         67          54       38    0

The prop.table() command has an optional second argument margin that calculates the proportion of observations by row (margin = 1) or column (margin = 2). Note that the term margin refers to the “margins” (i.e., the outer edges) of the table, where the sum of the rows and columns are often written. In the code chunk below we do not specify the margin parameter in prop.table(), so each cell represents the proportion overall observations in the data set. For example, 5.4% of all employees work in Sales and have a high school diploma.

prop.table(table(employees$Division, employees$Degree))

                 
                  Associate's Bachelor's High School Master's  Ph.D
  Accounting            0.000      0.031       0.000    0.032 0.000
  Corporate             0.000      0.020       0.000    0.040 0.043
  Engineering           0.000      0.036       0.000    0.043 0.157
  Human Resources       0.035      0.030       0.000    0.032 0.000
  Operations            0.110      0.016       0.146    0.015 0.000
  Sales                 0.055      0.067       0.054    0.038 0.000

If we set margin equal to 1, each cell represents the proportion of observations by row. For example, of all employees in Accounting, 49.2% have a Bachelor’s.

prop.table(table(employees$Division, employees$Degree), margin = 1)

                 
                  Associate's Bachelor's High School   Master's       Ph.D
  Accounting       0.00000000 0.49206349  0.00000000 0.50793651 0.00000000
  Corporate        0.00000000 0.19417476  0.00000000 0.38834951 0.41747573
  Engineering      0.00000000 0.15254237  0.00000000 0.18220339 0.66525424
  Human Resources  0.36082474 0.30927835  0.00000000 0.32989691 0.00000000
  Operations       0.38327526 0.05574913  0.50871080 0.05226481 0.00000000
  Sales            0.25700935 0.31308411  0.25233645 0.17757009 0.00000000

If we set margin equal to 2, each cell represents the proportion of observations by column. For example, of all employees with an Associate’s, 55.0% work in Operations.

prop.table(table(employees$Division, employees$Degree), margin = 2)

                 
                  Associate's Bachelor's High School Master's  Ph.D
  Accounting            0.000      0.155       0.000    0.160 0.000
  Corporate             0.000      0.100       0.000    0.200 0.215
  Engineering           0.000      0.180       0.000    0.215 0.785
  Human Resources       0.175      0.150       0.000    0.160 0.000
  Operations            0.550      0.080       0.730    0.075 0.000
  Sales                 0.275      0.335       0.270    0.190 0.000

Data Science for Managers

1.2. Summary Statistics¶

1.2.1. Quantitative Variables¶

1.2.2. Categorical Variables¶