Data Frame Basics

Now that our employee data has been read into a data frame, we can begin exploring the data! We will start by exploring the dimensions of the data set. We can determine the number of rows (or observations) in our data set with the nrow() function:

Syntax

nrow(x)

  • Required arguments

    • x: A data frame.

nrow(employees)
[1] 1000

Similarly, we can use ncol() to determine the number of columns:

Syntax

ncol(x)

  • Required arguments

    • x: A data frame.

ncol(employees)
[1] 10

The dim() function returns the full dimensions of the data (i.e., both the number of rows and columns):

Syntax

dim(x)

  • Required arguments

    • x: A data frame.

dim(employees)
[1] 1000   10

We can view the first and last few rows of the data set with the head() and tail() functions, respectively:

Syntax

head(x) & tail(x)

  • Required arguments

    • x: A data frame.

head(employees)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6881 al-Rahimi, Tayyiba Female 51 10 High School 2/23/1990 FALSE Operations $108,804
2671 Lewis, Austin Male 34 4 Ph.D 2/23/2007 FALSE Engineering $182,343
8925 el-Jaffer, Manaal Female 50 10 Master's 2/23/1991 FALSE Engineering $206,770
2769 Soto, Michael Male 52 10 High School 2/23/1987 FALSE Sales $183,407
2658 al-Ebrahimi, MamoonMale 55 8 Ph.D 2/23/1985 FALSE Corporate $236,240
1933 Medina, Brandy Female 62 7 Associate's 2/23/1979 TRUE Sales NA
tail(employees)
IDNameGenderAgeRatingDegreeStart_DateRetiredDivisionSalary
6681 Bruns, Austin Male 41 8 Ph.D 2/23/2002 FALSE Engineering $188,656
2031 Martinez, Caleb Male 57 8 Ph.D 2/23/1984 FALSE Engineering $218,430
2066 Gonzales, Alicia Female 32 2 High School 2/23/2008 FALSE Operations $84,032
3239 Larson, Trusten Male 37 5 Bachelor's 2/23/2002 FALSE Human Resources $149,789
3717 Levy-Minter, QuintinMale 53 10 Bachelor's 2/23/1989 FALSE Operations $172,703
4209 Dena, Gianna Female 49 6 Master's 2/23/1991 FALSE Accounting $185,445

It is easy to get a quick view of the structure of the data using the str() function. This shows the number of observations, the number of variables, the type of each variable, and the first few values of each variable.

Syntax

str(x)

  • Required arguments

    • x: A data frame.

str(employees)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':	1000 obs. of  10 variables:
 $ ID        : num  6881 2671 8925 2769 2658 ...
 $ Name      : chr  "al-Rahimi, Tayyiba" "Lewis, Austin" "el-Jaffer, Manaal" "Soto, Michael" ...
 $ Gender    : chr  "Female" "Male" "Female" "Male" ...
 $ Age       : num  51 34 50 52 55 62 47 43 27 30 ...
 $ Rating    : num  10 4 10 10 8 7 8 8 7 6 ...
 $ Degree    : chr  "High School" "Ph.D" "Master's" "High School" ...
 $ Start_Date: chr  "2/23/1990" "2/23/2007" "2/23/1991" "2/23/1987" ...
 $ Retired   : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ Division  : chr  "Operations" "Engineering" "Engineering" "Sales" ...
 $ Salary    : chr  "$108,804" "$182,343" "$206,770" "$183,407" ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Name = col_character(),
  ..   Gender = col_character(),
  ..   Age = col_double(),
  ..   Rating = col_double(),
  ..   Degree = col_character(),
  ..   Start_Date = col_character(),
  ..   Retired = col_logical(),
  ..   Division = col_character(),
  ..   Salary = col_character()
  .. )
NULL

After reading in a data set, it is best practice to check the dimensions of the data and explore its structure using the functions shown in this section. This will help uncover any immediate problems with the data.

So far, we have seen some functions that can be applied to an entire data frame. However, we often want to work with an individual column in a data frame. For example, we may be interested in calculating the average Age of all employees in the data set. We can access specific columns of a data frame using the $ operator, which takes the general form:

Syntax

dataFrameName$variableName

If we write employees$Age, we will get an atomic vector with the age of the 1,000 employees in the data frame. If you recall from the previous chapter, there are many different functions we can apply to atomic vectors in R. Because employees$Age is an atomic vector, we can apply those functions here to explore the Age variable. For example, to calculate the mean, minimum, and maximum Age, we could write:

mean(employees$Age)
min(employees$Age)
max(employees$Age)
[1] 45.53
[1] 25
[1] 65