1. Exploring Data

Motivation

Familiarize yourself with the data and establish a common set of facts among all stakeholders. Answer basic descriptive questions and identify irregularities in the data.

Methods

  • Summary Statistics - Unambiguous, numerical measures of the data.

  • Visualization - Visual representations of data that tell a clear and compelling story.

Message

Present data-driven insights to stakeholders as clearly as possible. Complement domain-area expertise with quantitative evidence. Focus on producing insights that are actionable.

One of the fundamental pillars of data science is to understand the data by visualizing it and computing basic descriptive summary statistics (e.g., average, standard deviation, maximum, and minimum). This collection of techniques is typically referred to as exploratory data analysis (EDA). Often, visualizing data is enough to answer basic descriptive questions (such as, which types of customers are buying different products?) devise more complex hypotheses about various relationships (such as, which types of customers are more likely to buy different products?) and identify irregularities (such as mistakes in the data collection or outlier data).

Why is this important?

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables. Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

IBM, Exploratory Data Analysis

Descriptive statistics of key business metrics are aggregations of data that should form the information backbone of every enterprise. For example, sales, revenue, and customer churn are all examples of business metrics. Creating meaningful visualizations and analyzing descriptive statistics is the first important step in addressing business problems with data.

How will this help me as a manager?

An understanding of EDA will help you:

  • Develop a deeper understanding of key business metrics;

  • Examine assumptions and hypotheses more rigorously;

  • Convince stakeholders of new insights through compelling visualizations.

In this chapter, we will explore EDA using the employee data introduced in the R Bootcamp. This data contains information on 1,000 employees at a software company, and is stored in a data frame called employees:

IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 1990-02-23 FALSE Operations 108804
2671 Lewis, Austin Male 34 4 Ph.D 2007-02-23 FALSE Engineering 182343
8925 el-Jaffer, Manaal Female 50 10 Master's 1991-02-23 FALSE Engineering 206770
2769 Soto, Michael Male 52 10 High School 1987-02-23 FALSE Sales 183407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 1985-02-23 FALSE Corporate 236240
1933 Medina, Brandy Female 62 7 Associate's 1979-02-23 TRUE Sales NA

These variables are defined as follows:

  • ID: A unique ID for each employee.

  • Name: The name of each employee.

  • Gender: The gender of each employee.

  • Age: The age of each employee at the time the data were collected.

  • Rating: Each employee’s rating from one to ten on their last performance evaluation.

  • Degree: The highest degree attained by the employee.

  • Start_Date: The date the employee started with the company.

  • Retired: Whether or not the employee is retired (TRUE / FALSE).

  • Division: The division the employee works in.

  • Salary: The employee’s most recent yearly salary.