Fixing Variable Types¶
As we saw in a previous section, R does not always correctly guess the appropriate type for our columns. This is a very common occurrence, and fixing column types is a tedious but important step in preparing a data frame for analysis.
Fixing Numeric Variables¶
When the read_csv()
function read in the data, it assumed that the Salary
column was a character
instead of a numeric
. This is because the data includes dollar signs ($
), commas (,
), and periods (.
), which R interprets as character
s. Fortunately, it is very easy to correct this using the parse_number()
function from the tidyverse, which uses the following syntax:
Syntax
tidyverse::parse_number(x, locale = default_locale(), ...)
Required arguments
x
: An atomic vector with values you would like to convert to anumeric
.
Optional arguments
locale
: This is used to control the parsing convention for numbers. By default, the function assumes that periods (.
) are used for decimal marks and commas (,
) are used for grouping (e.g., numbers are written as $1,500.25). You can explicitly change the characters that are used for decimal marks and groupings by setting changing thegrouping_mark
anddecimal_mark
, respectively. For example, if numbers are written in the European convention (e.g., numbers are written as €1.500,25), you could setlocale=locale(grouping_mark=".", decimal_mark=",")
.
Let’s first try applying this function to a single value to see how it works:
parse_number("$1,500.25")
[1] 1500.25
If our data is recorded in a different format, we can explicitly set the decimal mark and grouping characters in the locale
argument so that the data is converted properly:
parse_number("€1.500,25", locale=locale(grouping_mark=".", decimal_mark=","))
[1] 1500.25
To convert the entire Salary
column to a numeric
, we can apply parse_number()
to the entire column, and then store the parsed values back into the Salary
column:
employees$Salary <- parse_number(employees$Salary)
Now if we view the class of Salary
, it will show numeric
:
class(employees$Salary)
[1] "numeric"
Finally, if we view the first few rows of our data frame with head()
, we’ll see that Salary
no longer contains dollar signs, decimals, or commas:
head(employees)
ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
---|---|---|---|---|---|---|---|---|---|
6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 2/23/1990 | FALSE | Operations | 108804 |
2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2/23/2007 | FALSE | Engineering | 182343 |
8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 2/23/1991 | FALSE | Engineering | 206770 |
2769 | Soto, Michael | Male | 52 | 10 | High School | 2/23/1987 | FALSE | Sales | 183407 |
2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 2/23/1985 | FALSE | Corporate | 236240 |
1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 2/23/1979 | TRUE | Sales | NA |
Fixing Date Variables¶
As you might expect, the tidyverse also has a parse_date()
function that we can use to convert the Start_Date
column to a Date
. This function uses the following syntax:
Syntax
tidyverse::parse_date(x, format="", ...)
Required arguments
x
: An atomic vector with values you would like to convert to aDate
.
Optional arguments
format
: The format of the date.
Because dates can be recorded in a variety of ways, R has a set of symbols that can be used to represent different date formats:
Symbol |
Meaning |
Example |
---|---|---|
%d |
day as a number |
01-31 |
%a |
abbreviated weekday |
Mon |
%A |
unabbreviated weekday |
Monday |
%m |
month (00-12) |
00-12 |
%b |
abbreviated month |
Jan |
%B |
unabbreviated month |
January |
%y |
2-digit year |
07 |
%Y |
4-digit year |
2007 |
Source: here.
Below we see some examples of the parse_date()
function applied to dates of different formats:
parse_date("25-06-99", format="%d-%m-%y")
[1] "1999-06-25"
parse_date("January 12, 2021", format="%B %d, %Y")
[1] "2021-01-12"
parse_date("08/18/95", format="%m/%d/%y")
[1] "1995-08-18"
parse_date("12Feb2003", format="%d%b%Y")
[1] "2003-02-12"
Now we will use the format_date()
function to convert the entire Start_Date
column to a Date
. This column is coded as month/day/year
, so the format
of our date is %m/%d/%Y
.
employees$Start_Date <- parse_date(employees$Start_Date, format = "%m/%d/%Y")
Now if we view the class of Start_Date
, it will show Date
:
class(employees$Start_Date)
[1] "Date"