1. Exploring Data¶

Motivation

Familiarize yourself with the data and establish a common set of facts among all stakeholders. Answer basic descriptive questions and identify irregularities in the data.

Methods

Summary Statistics - Unambiguous, numerical measures of the data.
Visualization - Visual representations of data that tell a clear and compelling story.

Message

Present data-driven insights to stakeholders as clearly as possible. Complement domain-area expertise with quantitative evidence. Focus on producing insights that are actionable.

One of the fundamental pillars of data science is to understand the data by visualizing it and computing basic descriptive summary statistics (e.g., average, standard deviation, maximum, and minimum). This collection of techniques is typically referred to as exploratory data analysis (EDA). Often, visualizing data is enough to answer basic descriptive questions (such as, which types of customers are buying different products?) devise more complex hypotheses about various relationships (such as, which types of customers are more likely to buy different products?) and identify irregularities (such as mistakes in the data collection or outlier data).

Why is this important?

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables. Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

IBM, Exploratory Data Analysis

Descriptive statistics of key business metrics are aggregations of data that should form the information backbone of every enterprise. For example, sales, revenue, and customer churn are all examples of business metrics. Creating meaningful visualizations and analyzing descriptive statistics is the first important step in addressing business problems with data.

How will this help me as a manager?

An understanding of EDA will help you:

Develop a deeper understanding of key business metrics;
Examine assumptions and hypotheses more rigorously;
Convince stakeholders of new insights through compelling visualizations.

In this chapter, we will explore EDA using the employee data introduced in the R Bootcamp. This data contains information on 1,000 employees at a software company, and is stored in a data frame called employees:

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
6881	al-Rahimi, Tayyiba	Female	51	10	High School	1990-02-23	FALSE	Operations	108804
2671	Lewis, Austin	Male	34	4	Ph.D	2007-02-23	FALSE	Engineering	182343
8925	el-Jaffer, Manaal	Female	50	10	Master's	1991-02-23	FALSE	Engineering	206770
2769	Soto, Michael	Male	52	10	High School	1987-02-23	FALSE	Sales	183407
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	1985-02-23	FALSE	Corporate	236240
1933	Medina, Brandy	Female	62	7	Associate's	1979-02-23	TRUE	Sales	NA

These variables are defined as follows:

ID: A unique ID for each employee.
Name: The name of each employee.
Gender: The gender of each employee.
Age: The age of each employee at the time the data were collected.
Rating: Each employee’s rating from one to ten on their last performance evaluation.
Degree: The highest degree attained by the employee.
Start_Date: The date the employee started with the company.
Retired: Whether or not the employee is retired (TRUE / FALSE).
Division: The division the employee works in.
Salary: The employee’s most recent yearly salary.

Data Science for Managers

1. Exploring Data¶