In [1]:
library(tidyverse)
employees <- read_csv("../_build/data/employee_data.csv")
employees$Salary <- parse_number(employees$Salary)
employees$Start_Date <- parse_date(employees$Start_Date, format = "%m/%d/%Y")

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2


-- Attaching packages --------------------------------------- tidyverse 1.2.1 --


v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  


-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


Parsed with column specification:
cols(
  ID = col_double(),
  Name = col_character(),
  Gender = col_character(),
  Age = col_double(),
  Rating = col_double(),
  Degree = col_character(),
  Start_Date = col_character(),
  Retired = col_logical(),
  Division = col_character(),
  Salary = col_character()
)


# Exploring Data

````{panels}
:column: col-4
:card: border-2
**Motivation**
^^^
Familiarize yourself with the data and establish a common set of facts among all stakeholders. Answer basic descriptive questions and identify irregularities in the data.
---
**Methods**
^^^
+ Summary Statistics - Unambiguous, numerical measures of the data.
+ Visualization - Visual representations of data that tell a clear and compelling story.
---
**Message**
^^^
Present data-driven insights to stakeholders as clearly as possible. Complement domain-area expertise with quantitative evidence. Focus on producing insights that are actionable. 
````

One of the fundamental pillars of data science is to understand the data by visualizing it and computing basic descriptive summary statistics (*e.g.*, average, standard deviation, maximum, and minimum). This collection of techniques is typically referred to as **exploratory data analysis (EDA)**. Often, visualizing data is enough to answer basic descriptive questions (such as, which types of customers are buying different products?) devise more complex hypotheses about various relationships (such as, which types of customers are more likely to buy different products?) and identify irregularities (such as mistakes in the data collection or outlier data).

```{admonition} Why is this important?
The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables. Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

*IBM, [Exploratory Data Analysis](https://www.ibm.com/cloud/learn/exploratory-data-analysis)*
```

Descriptive statistics of key business metrics are aggregations of data that should form the information backbone of every enterprise. For example, sales, revenue, and customer churn are all examples of business metrics. Creating meaningful visualizations and analyzing descriptive statistics is the first important step in addressing business problems with data.


```{admonition} How will this help me as a manager?
An understanding of EDA will help you: 

+ Develop a deeper understanding of key business metrics;
+ Examine assumptions and hypotheses more rigorously; 
+ Convince stakeholders of new insights through compelling visualizations.
```

In this chapter, we will explore EDA using the employee data introduced in the [R Bootcamp](../00_bootcamp/02_dataframes/dataframes.html#data-frames). This data contains information on 1,000 employees at a software company, and is stored in a data frame called `employees`:

In [2]:
head(employees)

ID,Name,Gender,Age,Rating,Degree,Start_Date,Retired,Division,Salary
6881,"al-Rahimi, Tayyiba",Female,51,10,High School,1990-02-23,False,Operations,108804.0
2671,"Lewis, Austin",Male,34,4,Ph.D,2007-02-23,False,Engineering,182343.0
8925,"el-Jaffer, Manaal",Female,50,10,Master's,1991-02-23,False,Engineering,206770.0
2769,"Soto, Michael",Male,52,10,High School,1987-02-23,False,Sales,183407.0
2658,"al-Ebrahimi, Mamoon",Male,55,8,Ph.D,1985-02-23,False,Corporate,236240.0
1933,"Medina, Brandy",Female,62,7,Associate's,1979-02-23,True,Sales,


These variables are defined as follows:

+ `ID`: A unique ID for each employee.
+ `Name`: The name of each employee.
+ `Gender`: The gender of each employee.
+ `Age`: The age of each employee at the time the data were collected.
+ `Rating`: Each employee's rating from one to ten on their last performance evaluation.
+ `Degree`: The highest degree attained by the employee.
+ `Start_Date`: The date the employee started with the company.
+ `Retired`: Whether or not the employee is retired (`TRUE` / `FALSE`).
+ `Division`: The division the employee works in.
+ `Salary`: The employee's most recent yearly salary.