Reading in Data¶
The tidyverse contains a variety of functions for reading in data from different file types. Our employee data is saved in a comma-separated values (.csv) file called employee_data.csv
. We can read the data from this file into a data frame using the read_csv() function, which uses the following syntax:
Syntax
tidyverse::read_csv(file, col_names=TRUE, skip=0, …)
Required arguments
file
: The file path of the file you would like to read in. Note that the path must be surrounded in quotation marks.
Optional arguments
col_names
: When this argument isTRUE
, the first row of the file is assumed to contain the column names of the data set. When it isFALSE
, the first row is assumed to contain data, and column names are generated automatically (X1
,X2
,X3
, etc.)skip
: The number of rows at the top of the file to skip when reading in the data. This is useful if the first few rows of your data file have text you want to ignore.
Below we use this function to read our data from the csv file into a data frame called employees
. In this case, our data file is stored at the path "../../_build/data/employee_data.csv"
.
employees <- read_csv("https://dsm.business/data/employee_data.csv")
Parsed with column specification:
cols(
ID = col_double(),
Name = col_character(),
Gender = col_character(),
Age = col_double(),
Rating = col_double(),
Degree = col_character(),
Start_Date = col_character(),
Retired = col_logical(),
Division = col_character(),
Salary = col_double()
)
By default, the function prints out the assumed data type of each column. For example, the function assumes that ID
is a double
(a sub-class of the numeric
type), Name
is a character
, Retired
is a logical
, etc. These seem like the correct choices. However, you may also notice that the function got some of the types wrong. For example, Start_Date
, Degree
, and Salary
were all read in as character
s instead of as a Date
, factor
, and numeric
, respectively. We will see how to correct this in a subsequent section.
If our data were in an Excel (.xlsx) file instead of a csv, we could read it in using the read_excel()
function from the readxl
package (Wickham and Bryan 2019). This function uses the following syntax:
Syntax
readxl::read_excel(file, sheet = 1, col_names = TRUE, range = NULL, skip = 0 …)
Required arguments
file
: The file path of the file you would like to read in. Note that the path must be surrounded in quotation marks.
Optional arguments
sheet
: The sheet within the Excel workbook that contains the data you want to read in. You can either specify the sheet number (the default issheet = 1
), or the name of the sheet as a string (e.g.,sheet = "Sheet Name"
).col_names
: When this argument isTRUE
, the first row of the file is assumed to contain the column names of the data set. When it isFALSE
, the first row is assumed to contain data, and column names are generated automatically (X1
,X2
,X3
, etc.)range
: The range of cells in the sheet that contains the data you would like to read in. For example, if your only wanted to read in the data in the rangeB4:F100
, you would setrange = "B4:F100"
.skip
: The number of rows at the top of the file to skip when reading in the data. This is useful if the first few rows of your data file have text you want to ignore.
Although .csv and .xlsx are very common file types, you will likely encounter data stored in other file types (tab-separated files (.tab), text files (.txt), etc.) Here we will not show the functions to read in data from all of these different file types. However, they will generally be of a similar form as the read_csv()
and read_excel()
functions shown above. Always start by Googling to find the appropriate function for the type of file you need to read into R.