1.4. Outliers¶
Although outliers are often visually obvious to those familiar with the data, there is no strict mathematical definition. We define an outlier as any value that is an extreme distance from the other values in a data set based on the context of the problem. This implies that deciding what is and is not an outlier is somewhat subjective and requires domain understanding. Detecting and dealing with outliers is an important step in exploratory data analysis, as extreme values can have an out-sized impact on summary statistics and statistical models.
One common rule-of-thumb for detecting outliers for quantitative variables is known as the 1.5 IQR rule. If you review the definition of a boxplot, you will see that points that fall beyond the boxplot’s “whiskers” are considered outliers. As previously stated, the whiskers are defined as Q1 – 1.5(IQR) and Q3 + 1.5(IQR). Note that the IQR rule only applies to quantitative variables, and there is no clear concept of outliers for categorical data
Once an outlier has been identified, how should it be addressed? Removing any observation from a data set, particularly an outlier, will fundamentally change the analysis. Therefore, it is important to think carefully about what data are removed, and carefully document how the data are processed. The most straightforward types of outliers to deal with are data entry errors, which occur when a data point is recorded incorrectly. These outliers can simply be corrected. For outliers that are not the result of a data entry error, there are several different solutions:
Use statistical techniques that are robust to outliers;
Exclude the outlier if it can be shown that it is not part of the population of interest;
Perform the analysis with and without the outlier(s) and evaluate whether the conclusions are consistent.
The third approach is particularly useful if there are significant concerns about removing observations.