Data Frame Basics¶

Now that our employee data has been read into a data frame, we can begin exploring the data! We will start by exploring the dimensions of the data set. We can determine the number of rows (or observations) in our data set with the nrow() function:

Syntax

nrow(x)

Required arguments
- x: A data frame.

nrow(employees)

[1] 1000

Similarly, we can use ncol() to determine the number of columns:

Syntax

ncol(x)

Required arguments
- x: A data frame.

ncol(employees)

[1] 10

The dim() function returns the full dimensions of the data (i.e., both the number of rows and columns):

Syntax

dim(x)

Required arguments
- x: A data frame.

dim(employees)

[1] 1000   10

We can view the first and last few rows of the data set with the head() and tail() functions, respectively:

Syntax

head(x) & tail(x)

Required arguments
- x: A data frame.

head(employees)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
6881	al-Rahimi, Tayyiba	Female	51	10	High School	2/23/1990	FALSE	Operations	$108,804
2671	Lewis, Austin	Male	34	4	Ph.D	2/23/2007	FALSE	Engineering	$182,343
8925	el-Jaffer, Manaal	Female	50	10	Master's	2/23/1991	FALSE	Engineering	$206,770
2769	Soto, Michael	Male	52	10	High School	2/23/1987	FALSE	Sales	$183,407
2658	al-Ebrahimi, Mamoon	Male	55	8	Ph.D	2/23/1985	FALSE	Corporate	$236,240
1933	Medina, Brandy	Female	62	7	Associate's	2/23/1979	TRUE	Sales	NA

tail(employees)

ID	Name	Gender	Age	Rating	Degree	Start_Date	Retired	Division	Salary
6681	Bruns, Austin	Male	41	8	Ph.D	2/23/2002	FALSE	Engineering	$188,656
2031	Martinez, Caleb	Male	57	8	Ph.D	2/23/1984	FALSE	Engineering	$218,430
2066	Gonzales, Alicia	Female	32	2	High School	2/23/2008	FALSE	Operations	$84,032
3239	Larson, Trusten	Male	37	5	Bachelor's	2/23/2002	FALSE	Human Resources	$149,789
3717	Levy-Minter, Quintin	Male	53	10	Bachelor's	2/23/1989	FALSE	Operations	$172,703
4209	Dena, Gianna	Female	49	6	Master's	2/23/1991	FALSE	Accounting	$185,445

It is easy to get a quick view of the structure of the data using the str() function. This shows the number of observations, the number of variables, the type of each variable, and the first few values of each variable.

Syntax

str(x)

Required arguments
- x: A data frame.

str(employees)

Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':	1000 obs. of  10 variables:
 $ ID        : num  6881 2671 8925 2769 2658 ...
 $ Name      : chr  "al-Rahimi, Tayyiba" "Lewis, Austin" "el-Jaffer, Manaal" "Soto, Michael" ...
 $ Gender    : chr  "Female" "Male" "Female" "Male" ...
 $ Age       : num  51 34 50 52 55 62 47 43 27 30 ...
 $ Rating    : num  10 4 10 10 8 7 8 8 7 6 ...
 $ Degree    : chr  "High School" "Ph.D" "Master's" "High School" ...
 $ Start_Date: chr  "2/23/1990" "2/23/2007" "2/23/1991" "2/23/1987" ...
 $ Retired   : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ Division  : chr  "Operations" "Engineering" "Engineering" "Sales" ...
 $ Salary    : chr  "$108,804" "$182,343" "$206,770" "$183,407" ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Name = col_character(),
  ..   Gender = col_character(),
  ..   Age = col_double(),
  ..   Rating = col_double(),
  ..   Degree = col_character(),
  ..   Start_Date = col_character(),
  ..   Retired = col_logical(),
  ..   Division = col_character(),
  ..   Salary = col_character()
  .. )
NULL

After reading in a data set, it is best practice to check the dimensions of the data and explore its structure using the functions shown in this section. This will help uncover any immediate problems with the data.

So far, we have seen some functions that can be applied to an entire data frame. However, we often want to work with an individual column in a data frame. For example, we may be interested in calculating the average Age of all employees in the data set. We can access specific columns of a data frame using the $ operator, which takes the general form:

Syntax

dataFrameName$variableName

If we write employees$Age, we will get an atomic vector with the age of the 1,000 employees in the data frame. If you recall from the previous chapter, there are many different functions we can apply to atomic vectors in R. Because employees$Age is an atomic vector, we can apply those functions here to explore the Age variable. For example, to calculate the mean, minimum, and maximum Age, we could write:

mean(employees$Age)
min(employees$Age)
max(employees$Age)

[1] 45.53

[1] 25

[1] 65

Data Science for Managers

Data Frame Basics¶