2.3. (§) Visualization with ggplot2

Note

This section is optional, and will not be covered in the DSM course.

The tidyverse comes with a popular ecosystem for visualizing data known as ggplot2. In the previous chapter, the section Visualization shows how to create visualizations using only base R functions. These functions generally have straightforward and simple syntax but lack customization, making them difficult to use when creating advanced visualizations. Therefore, many users choose to use ggplot2 to create visualizations in R instead of the base R functions. In this section, we will see how to create all of the visualizations from Visualization with ggplot2.

Visualizations made with ggplot2 begin with the ggplot() function, which is used to specify the variables we want to visualize. Then, additional parameters for the plot are specified using the + operator (see the examples below).

2.3.1. Quantitative variables

2.3.1.1. Histogram

First let’s create a histogram of a single quantitative variable, Salary. Within ggplot() the first argument we specify is the name of the data frame (employees). The second argument is used to set the “aesthetic mappings” of the plot using the aes() function; this essentially describes how the variables in the data set should be mapped onto different properties of the plot. Here we are only working with a single variable (Salary), so within aes() we simply specify x = Salary. We will see more complicated calls to aes() in later examples.

To create a histogram, we combine our call to ggplot() with + geom_histogram():

ggplot(employees, aes(x = Salary)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
../_images/ggplot_3_1.png

2.3.1.2. Boxplot

To create a boxplot, we simply change geom_histogram() to geom_boxplot(). Note that unlike the histogram, we are now plotting Salary on the y-axis, so we change the aesthetic mapping from aes(x = Salary) to aes(y = Salary).

ggplot(employees, aes(y = Salary)) + geom_boxplot()
../_images/ggplot_6_0.png

2.3.1.3. Side-by-side boxplot

Now imagine we wanted to compare the distribution of a quantitative variable over the values of a categorical variable. For example, we may want to visualize how Salary differs by Degree. To do this, we set y = Salary and x = Degree within our call to aes(), which indicates that Salary should be treated as the y-variable and Degree should be treated as the x-variable:

ggplot(employees, aes(y = Salary, x = Degree)) + geom_boxplot()
../_images/ggplot_9_0.png

2.3.1.4. Scatter plot

Finally, imagine we wanted to create a scatter plot depicting the relationship between two quantitative variables, Salary and Age. To do this, we set y = Salary and x = Age within our call to aes(), which indicates that Age should be plotted on the x-axis and Salary should be plotted on the y-axis. To create a scatter plot, we then use geom_point():

ggplot(employees, aes(y = Salary, x = Age)) + geom_point()
../_images/ggplot_12_0.png

2.3.2. Categorical variables

2.3.2.1. Bar plot

We can create a bar plot of a categorical variable in ggplot using geom_bar():

ggplot(employees, aes(x = Division)) + geom_bar()
../_images/ggplot_15_0.png