Data Frame Basics¶
Now that our employee data has been read into a data frame, we can begin exploring the data! We will start by exploring the dimensions of the data set. We can determine the number of rows (or observations) in our data set with the nrow()
function:
Syntax
nrow(x)
Required arguments
x
: A data frame.
nrow(employees)
[1] 1000
Similarly, we can use ncol()
to determine the number of columns:
Syntax
ncol(x)
Required arguments
x
: A data frame.
ncol(employees)
[1] 10
The dim()
function returns the full dimensions of the data (i.e., both the number of rows and columns):
Syntax
dim(x)
Required arguments
x
: A data frame.
dim(employees)
[1] 1000 10
We can view the first and last few rows of the data set with the head()
and tail()
functions, respectively:
Syntax
head(x)
& tail(x)
Required arguments
x
: A data frame.
head(employees)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 2/23/1990 | FALSE | Operations | $108,804 |
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2/23/2007 | FALSE | Engineering | $182,343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 2/23/1991 | FALSE | Engineering | $206,770 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 2/23/1987 | FALSE | Sales | $183,407 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 2/23/1985 | FALSE | Corporate | $236,240 |
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 2/23/1979 | TRUE | Sales | NA |
tail(employees)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6681 | Bruns, Austin | Male | 41 | 8 | Ph.D | 2/23/2002 | FALSE | Engineering | $188,656 |
2031 | Martinez, Caleb | Male | 57 | 8 | Ph.D | 2/23/1984 | FALSE | Engineering | $218,430 |
2066 | Gonzales, Alicia | Female | 32 | 2 | High School | 2/23/2008 | FALSE | Operations | $84,032 |
3239 | Larson, Trusten | Male | 37 | 5 | Bachelor's | 2/23/2002 | FALSE | Human Resources | $149,789 |
3717 | Levy-Minter, Quintin | Male | 53 | 10 | Bachelor's | 2/23/1989 | FALSE | Operations | $172,703 |
4209 | Dena, Gianna | Female | 49 | 6 | Master's | 2/23/1991 | FALSE | Accounting | $185,445 |
It is easy to get a quick view of the structure of the data using the str()
function. This shows the number of observations, the number of variables, the type of each variable, and the first few values of each variable.
Syntax
str(x)
Required arguments
x
: A data frame.
str(employees)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 10 variables:
$ ID : num 6881 2671 8925 2769 2658 ...
$ Name : chr "al-Rahimi, Tayyiba" "Lewis, Austin" "el-Jaffer, Manaal" "Soto, Michael" ...
$ Gender : chr "Female" "Male" "Female" "Male" ...
$ Age : num 51 34 50 52 55 62 47 43 27 30 ...
$ Rating : num 10 4 10 10 8 7 8 8 7 6 ...
$ Degree : chr "High School" "Ph.D" "Master's" "High School" ...
$ Start_Date: chr "2/23/1990" "2/23/2007" "2/23/1991" "2/23/1987" ...
$ Retired : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Division : chr "Operations" "Engineering" "Engineering" "Sales" ...
$ Salary : chr "$108,804" "$182,343" "$206,770" "$183,407" ...
- attr(*, "spec")=
.. cols(
.. ID = col_double(),
.. Name = col_character(),
.. Gender = col_character(),
.. Age = col_double(),
.. Rating = col_double(),
.. Degree = col_character(),
.. Start_Date = col_character(),
.. Retired = col_logical(),
.. Division = col_character(),
.. Salary = col_character()
.. )
NULL
After reading in a data set, it is best practice to check the dimensions of the data and explore its structure using the functions shown in this section. This will help uncover any immediate problems with the data.
So far, we have seen some functions that can be applied to an entire data frame. However, we often want to work with an individual column in a data frame. For example, we may be interested in calculating the average Age
of all employees in the data set. We can access specific columns of a data frame using the $
operator, which takes the general form:
Syntax
dataFrameName$variableName
If we write employees$Age
, we will get an atomic vector with the age of the 1,000 employees in the data frame. If you recall from the previous chapter, there are many different functions we can apply to atomic vectors in R. Because employees$Age
is an atomic vector, we can apply those functions here to explore the Age
variable. For example, to calculate the mean, minimum, and maximum Age
, we could write:
mean(employees$Age)
min(employees$Age)
max(employees$Age)
[1] 45.53
[1] 25
[1] 65