7.4. Building Logistic Regression Models¶
In practice, we typically build logistic regression models so that we can make predictions about observations in the population of interest. Therefore, we want to do more than just fit logistic regression models to the data at hand; we also need to determine how accurate those models are at making predictions. To accomplish this, data scientists split the available data into distinct sets. A large portion of the data is used to build models, while a smaller portion of the data is reserved to determine the accuracy of those models. This process will be described in detail in the section Partitioning Data, but for now, we will simply split our data into two sets: a training set for building models and a validation set for estimating the accuracy of those models.
So far we have built our logistic regression models on the full data set stored in the data frame deposit
. Here we randomly split deposit
into two data frames: train
, which contains 80% of the observations from deposit
, and validate
, which contains 20% of the observations from deposit
. Now we will build our logistic regression models on train
, and then calculate the predictive accuracy of those models on validate
.
First, let’s build a simple logistic regression on train
:
model1 <- glm(subscription ~ duration, data = train, family = "binomial")
Next, we will use predict()
to apply this model to the validate
set. We will store the predictions in a new column in validate
called model1Prob
:
validate$model1Prob <- predict(model1, validate, type = "response")
head(validate)
age | marital | education | default | housing | loan | contact | duration | campaign | previous | poutcome | subscription | model1Prob |
---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | single | tertiary | no | yes | no | cellular | 185 | 1 | 1 | failure | 0 | 0.07061358 |
36 | married | tertiary | no | yes | no | cellular | 341 | 1 | 2 | other | 0 | 0.11420667 |
39 | married | secondary | no | yes | no | cellular | 151 | 2 | 0 | unknown | 0 | 0.06341384 |
44 | single | secondary | no | no | no | unknown | 109 | 2 | 0 | unknown | 0 | 0.05546522 |
44 | married | secondary | no | no | no | cellular | 125 | 2 | 0 | unknown | 0 | 0.05837620 |
55 | married | primary | no | yes | no | unknown | 247 | 1 | 0 | unknown | 0 | 0.08571404 |
Now let’s build a multiple logistic regression model on train
using all of the available variables, and determine that model’s predictions in validate
. Note that if we want to include all possible \(X\) variables in the model, we do not need to type out each variable name; instead we can use the period character (.
) after the tilde (~
) to indicate we want to use every variable in the data frame:
# Build model
model2 <- glm(subscription ~ ., data = train, family = "binomial")
# Apply model to validate set
validate$model2Prob <- predict(model2, validate, type = "response")
# Output first few observations
head(validate)
age | marital | education | default | housing | loan | contact | duration | campaign | previous | poutcome | subscription | model1Prob | model2Prob |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | single | tertiary | no | yes | no | cellular | 185 | 1 | 1 | failure | 0 | 0.07061358 | 0.11232778 |
36 | married | tertiary | no | yes | no | cellular | 341 | 1 | 2 | other | 0 | 0.11420667 | 0.22667670 |
39 | married | secondary | no | yes | no | cellular | 151 | 2 | 0 | unknown | 0 | 0.06341384 | 0.03885416 |
44 | single | secondary | no | no | no | unknown | 109 | 2 | 0 | unknown | 0 | 0.05546522 | 0.02570542 |
44 | married | secondary | no | no | no | cellular | 125 | 2 | 0 | unknown | 0 | 0.05837620 | 0.06058859 |
55 | married | primary | no | yes | no | unknown | 247 | 1 | 0 | unknown | 0 | 0.08571404 | 0.02003328 |
Now that we have applied our models to the validate
set, how do we compare their performance? We will compare them on the validation set using a metric known as log loss.
7.4.1. Log Loss¶
Classification models are often scored with the log loss metric, which is defined as:
where
\(n\) is the number of observations in the data set;
\(y_i\) is the observed realization of observation \(i\); in this context it equals 1 if the person made a deposit and 0 if not; and
\(\hat{p_i}\) is the predicted probability that observation \(i\) will make a deposit according to the model.
To help understand log loss, think through the following scenarios:
Great predictions
\(y_i\) = 1 & \(\hat{p_i}\) = 0.99: the observation actually made a deposit, and the model estimated the probability of making a deposit to be 99%.
\[logloss_i = 1log(0.99) + (1-1)log(1-0.99) = log(0.99) \approx \mathbf{-0.004}\]\(y_i\) = 0 & \(\hat{p_i}\) = 0.01: the observation did not make a deposit, and the model estimated the probability of making a deposit to be 1%.
\[logloss_i = 0log(0.01) + (1-0)log(1-0.01) = log(0.99) \approx \mathbf{-0.004}\]Terrible predictions
\(y_i\) = 0 & \(\hat{p_i}\) = 0.99: the observation did not make a deposit, but the model estimated the probability of making a deposit to be 99%.
\[logloss_i = 0log(0.99) + (1-0)log(1-0.99) = log(0.01) = \mathbf{-2}\]\(y_i\) = 1 & \(\hat{p_i}\) = 0.01: the observation actually made a deposit, but the model estimated the probability of making a deposit to be only 1%.
\[logloss_i = 1log(0.01) + (1-1)log(1-0.01) = log(0.01) = \mathbf{-2}\]
From these examples, it is clear that the absolute value of the log loss is small when the model is close to the truth and large when the model is far off. Now consider the following:
Good (not great) predictions
\(y_i\) = 1 & \(\hat{p_i}\) = 0.66: the observation actually made a deposit, and the model estimated the probability of making a deposit to be 66%.
\[logloss_i = 1log(0.66) + (1-1)log(1-0.66) = log(0.66) \approx \mathbf{-0.1805}\]\(y_i\) = 0 & \(\hat{p_i}\) = 0.33: the observation did not make a deposit, and the model estimated the probability of making a deposit to be 33%.
\[logloss_i = 0log(0.33) + (1-0)log(1-0.33) = log(0.66) \approx \mathbf{-0.1805}\]Bad (not terrible) predictions
\(y_i\) = 0 & \(\hat{p_i}\) = 0.66: the observation did not make a deposit, but the model estimated the probability of making a deposit to be 66%.
\[logloss_i = 0log(0.66) + (1-0)log(1-0.66) = log(0.33) \approx \mathbf{-0.4815}\]\(y_i\) = 1 & \(\hat{p_i}\) = 0.33: the observation actually made a deposit, but the model estimated the probability of making a deposit to be only 33%.
\[logloss_i = 1log(0.33) + (1-1)log(1-0.33) = log(0.33) \approx \mathbf{-0.4815}\]
Notice that as our predictions go from terrible to bad to good to great, the log loss decreases in absolute value (i.e., it gets closer to zero). A perfect classifier would have a log loss of precisely zero. Less ideal classifiers have progressively larger values of log loss.
We can calculate log loss in R using the LogLoss()
function from the MLmetrics
package, which uses the following syntax:
Syntax
MLmetrics::LogLoss(y_pred, y_true)
Required arguments
y_pred
: An atomic vector with the model’s predicted probabilities.y_true
: An atomic vector with the true labels, represented numerically as 0 / 1.
The predictions for model1
are stored in the column validate$model1Prob
, and the true values are stored in validate$subscription
. The cell below uses LogLoss()
to calculate the log loss of the first model:
LogLoss(validate$model1Prob, validate$subscription)
Applying the same code to the second model:
LogLoss(validate$model2Prob, validate$subscription)
Because our second model has a lower log loss on the validation set, we conclude it is superior at predicting subscription
.