1. Exploring Data¶
Motivation
Familiarize yourself with the data and establish a common set of facts among all stakeholders. Answer basic descriptive questions and identify irregularities in the data.
Methods
Summary Statistics - Unambiguous, numerical measures of the data.
Visualization - Visual representations of data that tell a clear and compelling story.
Message
Present data-driven insights to stakeholders as clearly as possible. Complement domain-area expertise with quantitative evidence. Focus on producing insights that are actionable.
One of the fundamental pillars of data science is to understand the data by visualizing it and computing basic descriptive summary statistics (e.g., average, standard deviation, maximum, and minimum). This collection of techniques is typically referred to as exploratory data analysis (EDA). Often, visualizing data is enough to answer basic descriptive questions (such as, which types of customers are buying different products?) devise more complex hypotheses about various relationships (such as, which types of customers are more likely to buy different products?) and identify irregularities (such as mistakes in the data collection or outlier data).
Why is this important?
The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables. Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.
Descriptive statistics of key business metrics are aggregations of data that should form the information backbone of every enterprise. For example, sales, revenue, and customer churn are all examples of business metrics. Creating meaningful visualizations and analyzing descriptive statistics is the first important step in addressing business problems with data.
How will this help me as a manager?
An understanding of EDA will help you:
Develop a deeper understanding of key business metrics;
Examine assumptions and hypotheses more rigorously;
Convince stakeholders of new insights through compelling visualizations.
In this chapter, we will explore EDA using the employee data introduced in the R Bootcamp. This data contains information on 1,000 employees at a software company, and is stored in a data frame called employees
:
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 1990-02-23 | FALSE | Operations | 108804 |
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2007-02-23 | FALSE | Engineering | 182343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 1991-02-23 | FALSE | Engineering | 206770 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 1987-02-23 | FALSE | Sales | 183407 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 1985-02-23 | FALSE | Corporate | 236240 |
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 1979-02-23 | TRUE | Sales | NA |
These variables are defined as follows:
ID
: A unique ID for each employee.Name
: The name of each employee.Gender
: The gender of each employee.Age
: The age of each employee at the time the data were collected.Rating
: Each employee’s rating from one to ten on their last performance evaluation.Degree
: The highest degree attained by the employee.Start_Date
: The date the employee started with the company.Retired
: Whether or not the employee is retired (TRUE
/FALSE
).Division
: The division the employee works in.Salary
: The employee’s most recent yearly salary.