1.2. Summary Statistics¶
The concept of summary statistics should be familiar to most readers. These are the summary measures that all organizations use to track key business outcomes. While visualizations can provide a visual indication of typical values, outliers, variations, trends, and associations in a data set, summary statistics provide unambiguous, numerical measures.
1.2.1. Quantitative Variables¶
As we explore a data set, it is often useful to calculate summary statistics that provide a preliminary overview of the variables we are working with. There are two main types of quantitative variables, continuous and discrete. A continuous variable is one that can take on any value within a certain numerical range. For example, in the employees
data set Salary
is continuous because it can take on any value above $0. A discrete variable can take on only a discrete set of values (e.g., 2,4,6,8).
The most basic summary statistics for quantitative variables are measures of central tendency, such as:
The mode - the most frequent value in the data.
The mean - the average value, defined as the sum of all observations divided by the total number of observations. In mathematical notation, we would write this as:
Conventionally, we have a data set with \(n\) observations, and \(x_i\) represents the value of the \(i^{th}\) observation in that data set. In the formula above, the expression \(\sum^{n}_{i=1}{x_i}\) means “the sum of the \(n\) values of \(x\) in the data set.” To get the mean (\(\bar{x}\)), we divide that total by \(n\).
As we’ve seen before, we can calculate the mean in R with the mean()
function.
mean(employees$Salary, na.rm = TRUE)
[1] 156486
The median & quartiles - when the data are sorted:
The median is the middle value (i.e., the 50th percentile, also called the second quartile),
The first quartile is the 25th percentile, and
The third quartile is the 75th percentile.
We can calculate the median in R with the
median()
function.
Syntax
median(x, na.rm = FALSE)
Required arguments
x
: The atomic vector whose values one would like to find the median of.
Optional arguments
na.rm
: IfTRUE
, the function will remove any missing values (NA
s) in the atomic vector and find the median of the non-missing values. IfFALSE
, the function does not removeNA
s and will return a value ofNA
if there is anNA
in the atomic vector.
median(employees$Salary, na.rm = TRUE)
[1] 156289.5
We can calculate the first and third quartiles with quantile()
.
Syntax
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = FALSE)
Required arguments
x
: The atomic vector of values where we would like to find the quantiles.
Optional arguments
probs
: An atomic vector with the percentiles (between 0 and 1) we would like to calculate.na.rm
: IfTRUE
, the function will remove any missing values (NA
s) in the atomic vector and find the percentiles of the non-missing values. IfFALSE
, the function does not removeNA
s and will return a value ofNA
if there is anNA
in the atomic vector.
quantile(employees$Salary, na.rm = TRUE)
0% 25% 50% 75% 100%
29825.0 129693.8 156289.5 184742.2 266235.0
In addition to measures of central tendency, we often also want to measure the dispersion, or spread, of a data set. We can do that with the following measures:
The interquartile range (IQR) - the difference between the first quartile (i.e., the 25th percentile) and the third quartile (i.e., the 75th percentile).
The standard deviation & variance - variance and standard deviation are both measures of the spread of a data set. The minimum value of both measures is zero (which indicates no variation in the data), and the higher the values the more spread out the data are. The variance is calculated in squared units, while the standard deviation is recorded in the base units.
Formally, the variance of a data set is written as:
\[s^2 = \frac{1}{n - 1}\sum^{n}_{i=1}{(x_i - \bar{x})^2}\]Although variance is an important concept in statistics, it does not provide a very intuitive understanding of the spread of a data set, because it is in squared units. Instead, we more commonly look at the standard deviation, which is the square root of the variance. This can be thought of as roughly the average distance observations in the data set fall from the mean. Formally, the standard deviation of a data set is written as:
\[s = \sqrt{\frac{1}{n - 1}\sum^{n}_{i=1}{(x_i - \bar{x})^2}}\]We can calculate standard deviation and variance with the
sd()
andvar()
functions, respectively.
Syntax
sd(vectorName, na.rm = FALSE)
& var(vectorName, na.rm = FALSE)
Required arguments
The atomic vector whose values one would like to find the standard deviation/variance of.
Optional arguments
na.rm
: IfTRUE
, the function will remove any missing values (NA
s) in the atomic vector and find the standard deviation/variance of the non-missing values. IfFALSE
, the function does not removeNA
s and will return a value ofNA
if there is anNA
in the atomic vector.
sd(employees$Salary, na.rm = TRUE)
var(employees$Salary, na.rm = TRUE)
[1] 39479.84
[1] 1558657479
1.2.2. Categorical Variables¶
Categorical variables can be summarized by calculating the count or proportion of observations that take on each value of the variable. For example, Degree
in employees
can only take on the values High School
, Associate's
, Bachelor's
, Master's
, and Ph.D
. Categorical variables cannot be summarized by the mean, median, or standard deviation. Instead, these variables are often summarized using tables and bar plots. For categorical variables, the table()
and prop.table()
commands show the number and percentage (proportion) of observations in each category, respectively. Note that to use prop.table()
, we need to apply table()
first.
Syntax
table(x)
& prop.table(table(x))
Required arguments
x
: The atomic vector of values.
table(employees$Division)
Accounting Corporate Engineering Human Resources Operations
63 103 236 97 287
Sales
214
prop.table(table(employees$Division))
Accounting Corporate Engineering Human Resources Operations
0.063 0.103 0.236 0.097 0.287
Sales
0.214
Two categorical variables can be summarized in a two-way table using the same table()
and prop.table()
commands shown above. For example:
table(employees$Division, employees$Degree)
Associate's Bachelor's High School Master's Ph.D
Accounting 0 31 0 32 0
Corporate 0 20 0 40 43
Engineering 0 36 0 43 157
Human Resources 35 30 0 32 0
Operations 110 16 146 15 0
Sales 55 67 54 38 0
The prop.table()
command has an optional second argument margin
that calculates the proportion of observations by row (margin
= 1) or column (margin
= 2). Note that the term margin
refers to the “margins” (i.e., the outer edges) of the table, where the sum of the rows and columns are often written. In the code chunk below we do not specify the margin
parameter in prop.table()
, so each cell represents the proportion overall observations in the data set. For example, 5.4% of all employees work in Sales and have a high school diploma.
prop.table(table(employees$Division, employees$Degree))
Associate's Bachelor's High School Master's Ph.D
Accounting 0.000 0.031 0.000 0.032 0.000
Corporate 0.000 0.020 0.000 0.040 0.043
Engineering 0.000 0.036 0.000 0.043 0.157
Human Resources 0.035 0.030 0.000 0.032 0.000
Operations 0.110 0.016 0.146 0.015 0.000
Sales 0.055 0.067 0.054 0.038 0.000
If we set margin
equal to 1, each cell represents the proportion of observations by row. For example, of all employees in Accounting, 49.2% have a Bachelor’s.
prop.table(table(employees$Division, employees$Degree), margin = 1)
Associate's Bachelor's High School Master's Ph.D
Accounting 0.00000000 0.49206349 0.00000000 0.50793651 0.00000000
Corporate 0.00000000 0.19417476 0.00000000 0.38834951 0.41747573
Engineering 0.00000000 0.15254237 0.00000000 0.18220339 0.66525424
Human Resources 0.36082474 0.30927835 0.00000000 0.32989691 0.00000000
Operations 0.38327526 0.05574913 0.50871080 0.05226481 0.00000000
Sales 0.25700935 0.31308411 0.25233645 0.17757009 0.00000000
If we set margin
equal to 2, each cell represents the proportion of observations by column. For example, of all employees with an Associate’s, 55.0% work in Operations.
prop.table(table(employees$Division, employees$Degree), margin = 2)
Associate's Bachelor's High School Master's Ph.D
Accounting 0.000 0.155 0.000 0.160 0.000
Corporate 0.000 0.100 0.000 0.200 0.215
Engineering 0.000 0.180 0.000 0.215 0.785
Human Resources 0.175 0.150 0.000 0.160 0.000
Operations 0.550 0.080 0.730 0.075 0.000
Sales 0.275 0.335 0.270 0.190 0.000