Fixing Variable Types¶
As we saw in a previous section, R does not always correctly guess the appropriate type for our columns. This is a very common occurrence, and fixing column types is a tedious but important step in preparing a data frame for analysis.
Fixing Numeric Variables¶
When the read_csv() function read in the data, it assumed that the Salary column was a character instead of a numeric. This is because the data includes dollar signs ($), commas (,), and periods (.), which R interprets as characters. Fortunately, it is very easy to correct this using the parse_number() function from the tidyverse, which uses the following syntax:
Syntax
tidyverse::parse_number(x, locale = default_locale(), ...)
Required arguments
x: An atomic vector with values you would like to convert to anumeric.
Optional arguments
locale: This is used to control the parsing convention for numbers. By default, the function assumes that periods (.) are used for decimal marks and commas (,) are used for grouping (e.g., numbers are written as $1,500.25). You can explicitly change the characters that are used for decimal marks and groupings by setting changing thegrouping_markanddecimal_mark, respectively. For example, if numbers are written in the European convention (e.g., numbers are written as €1.500,25), you could setlocale=locale(grouping_mark=".", decimal_mark=",").
Let’s first try applying this function to a single value to see how it works:
parse_number("$1,500.25")
[1] 1500.25
If our data is recorded in a different format, we can explicitly set the decimal mark and grouping characters in the locale argument so that the data is converted properly:
parse_number("€1.500,25", locale=locale(grouping_mark=".", decimal_mark=","))
[1] 1500.25
To convert the entire Salary column to a numeric, we can apply parse_number() to the entire column, and then store the parsed values back into the Salary column:
employees$Salary <- parse_number(employees$Salary)
Now if we view the class of Salary, it will show numeric:
class(employees$Salary)
[1] "numeric"
Finally, if we view the first few rows of our data frame with head(), we’ll see that Salary no longer contains dollar signs, decimals, or commas:
head(employees)
| ID | Name | Gender | Age | Rating | Degree | Start_Date | Retired | Division | Salary |
|---|---|---|---|---|---|---|---|---|---|
| 6881 | al-Rahimi, Tayyiba | Female | 51 | 10 | High School | 2/23/1990 | FALSE | Operations | 108804 |
| 2671 | Lewis, Austin | Male | 34 | 4 | Ph.D | 2/23/2007 | FALSE | Engineering | 182343 |
| 8925 | el-Jaffer, Manaal | Female | 50 | 10 | Master's | 2/23/1991 | FALSE | Engineering | 206770 |
| 2769 | Soto, Michael | Male | 52 | 10 | High School | 2/23/1987 | FALSE | Sales | 183407 |
| 2658 | al-Ebrahimi, Mamoon | Male | 55 | 8 | Ph.D | 2/23/1985 | FALSE | Corporate | 236240 |
| 1933 | Medina, Brandy | Female | 62 | 7 | Associate's | 2/23/1979 | TRUE | Sales | NA |
Fixing Date Variables¶
As you might expect, the tidyverse also has a parse_date() function that we can use to convert the Start_Date column to a Date. This function uses the following syntax:
Syntax
tidyverse::parse_date(x, format="", ...)
Required arguments
x: An atomic vector with values you would like to convert to aDate.
Optional arguments
format: The format of the date.
Because dates can be recorded in a variety of ways, R has a set of symbols that can be used to represent different date formats:
Symbol |
Meaning |
Example |
|---|---|---|
%d |
day as a number |
01-31 |
%a |
abbreviated weekday |
Mon |
%A |
unabbreviated weekday |
Monday |
%m |
month (00-12) |
00-12 |
%b |
abbreviated month |
Jan |
%B |
unabbreviated month |
January |
%y |
2-digit year |
07 |
%Y |
4-digit year |
2007 |
Source: here.
Below we see some examples of the parse_date() function applied to dates of different formats:
parse_date("25-06-99", format="%d-%m-%y")
[1] "1999-06-25"
parse_date("January 12, 2021", format="%B %d, %Y")
[1] "2021-01-12"
parse_date("08/18/95", format="%m/%d/%y")
[1] "1995-08-18"
parse_date("12Feb2003", format="%d%b%Y")
[1] "2003-02-12"
Now we will use the format_date() function to convert the entire Start_Date column to a Date. This column is coded as month/day/year, so the format of our date is %m/%d/%Y.
employees$Start_Date <- parse_date(employees$Start_Date, format = "%m/%d/%Y")
Now if we view the class of Start_Date, it will show Date:
class(employees$Start_Date)
[1] "Date"