7.4. Building Logistic Regression Models

In practice, we typically build logistic regression models so that we can make predictions about observations in the population of interest. Therefore, we want to do more than just fit logistic regression models to the data at hand; we also need to determine how accurate those models are at making predictions. To accomplish this, data scientists split the available data into distinct sets. A large portion of the data is used to build models, while a smaller portion of the data is reserved to determine the accuracy of those models. This process will be described in detail in the section Partitioning Data, but for now, we will simply split our data into two sets: a training set for building models and a validation set for estimating the accuracy of those models.

So far we have built our logistic regression models on the full data set stored in the data frame deposit. Here we randomly split deposit into two data frames: train, which contains 80% of the observations from deposit, and validate, which contains 20% of the observations from deposit. Now we will build our logistic regression models on train, and then calculate the predictive accuracy of those models on validate.

First, let’s build a simple logistic regression on train:

model1 <- glm(subscription ~ duration, data = train, family = "binomial")

Next, we will use predict() to apply this model to the validate set. We will store the predictions in a new column in validate called model1Prob:

validate$model1Prob <- predict(model1, validate, type = "response")
head(validate)
agemaritaleducationdefaulthousingloancontactdurationcampaignpreviouspoutcomesubscriptionmodel1Prob
35 single tertiary no yes no cellular 185 1 1 failure 0 0.07061358
36 married tertiary no yes no cellular 341 1 2 other 0 0.11420667
39 married secondary no yes no cellular 151 2 0 unknown 0 0.06341384
44 single secondary no no no unknown 109 2 0 unknown 0 0.05546522
44 married secondary no no no cellular 125 2 0 unknown 0 0.05837620
55 married primary no yes no unknown 247 1 0 unknown 0 0.08571404

Now let’s build a multiple logistic regression model on train using all of the available variables, and determine that model’s predictions in validate. Note that if we want to include all possible \(X\) variables in the model, we do not need to type out each variable name; instead we can use the period character (.) after the tilde (~) to indicate we want to use every variable in the data frame:

# Build model
model2 <- glm(subscription ~  ., data = train, family = "binomial")

# Apply model to validate set
validate$model2Prob <- predict(model2, validate, type = "response")

# Output first few observations
head(validate)
agemaritaleducationdefaulthousingloancontactdurationcampaignpreviouspoutcomesubscriptionmodel1Probmodel2Prob
35 single tertiary no yes no cellular 185 1 1 failure 0 0.070613580.11232778
36 married tertiary no yes no cellular 341 1 2 other 0 0.114206670.22667670
39 married secondary no yes no cellular 151 2 0 unknown 0 0.063413840.03885416
44 single secondary no no no unknown 109 2 0 unknown 0 0.055465220.02570542
44 married secondary no no no cellular 125 2 0 unknown 0 0.058376200.06058859
55 married primary no yes no unknown 247 1 0 unknown 0 0.085714040.02003328

Now that we have applied our models to the validate set, how do we compare their performance? We will compare them on the validation set using a metric known as log loss.

7.4.1. Log Loss

Classification models are often scored with the log loss metric, which is defined as:

\[Log\ Loss = -\frac{1}{n}\sum^{n}_{i=1}[y_ilog(\hat{p_i}) + (1 - y_i)log(1 - \hat{p_i})]\]

where

  • \(n\) is the number of observations in the data set;

  • \(y_i\) is the observed realization of observation \(i\); in this context it equals 1 if the person made a deposit and 0 if not; and

  • \(\hat{p_i}\) is the predicted probability that observation \(i\) will make a deposit according to the model.

To help understand log loss, think through the following scenarios:

  • Great predictions

    • \(y_i\) = 1 & \(\hat{p_i}\) = 0.99: the observation actually made a deposit, and the model estimated the probability of making a deposit to be 99%.

    \[logloss_i = 1log(0.99) + (1-1)log(1-0.99) = log(0.99) \approx \mathbf{-0.004}\]
    • \(y_i\) = 0 & \(\hat{p_i}\) = 0.01: the observation did not make a deposit, and the model estimated the probability of making a deposit to be 1%.

    \[logloss_i = 0log(0.01) + (1-0)log(1-0.01) = log(0.99) \approx \mathbf{-0.004}\]
  • Terrible predictions

    • \(y_i\) = 0 & \(\hat{p_i}\) = 0.99: the observation did not make a deposit, but the model estimated the probability of making a deposit to be 99%.

    \[logloss_i = 0log(0.99) + (1-0)log(1-0.99) = log(0.01) = \mathbf{-2}\]
    • \(y_i\) = 1 & \(\hat{p_i}\) = 0.01: the observation actually made a deposit, but the model estimated the probability of making a deposit to be only 1%.

    \[logloss_i = 1log(0.01) + (1-1)log(1-0.01) = log(0.01) = \mathbf{-2}\]

From these examples, it is clear that the absolute value of the log loss is small when the model is close to the truth and large when the model is far off. Now consider the following:

  • Good (not great) predictions

    • \(y_i\) = 1 & \(\hat{p_i}\) = 0.66: the observation actually made a deposit, and the model estimated the probability of making a deposit to be 66%.

    \[logloss_i = 1log(0.66) + (1-1)log(1-0.66) = log(0.66) \approx \mathbf{-0.1805}\]
    • \(y_i\) = 0 & \(\hat{p_i}\) = 0.33: the observation did not make a deposit, and the model estimated the probability of making a deposit to be 33%.

    \[logloss_i = 0log(0.33) + (1-0)log(1-0.33) = log(0.66) \approx \mathbf{-0.1805}\]
  • Bad (not terrible) predictions

    • \(y_i\) = 0 & \(\hat{p_i}\) = 0.66: the observation did not make a deposit, but the model estimated the probability of making a deposit to be 66%.

    \[logloss_i = 0log(0.66) + (1-0)log(1-0.66) = log(0.33) \approx \mathbf{-0.4815}\]
    • \(y_i\) = 1 & \(\hat{p_i}\) = 0.33: the observation actually made a deposit, but the model estimated the probability of making a deposit to be only 33%.

    \[logloss_i = 1log(0.33) + (1-1)log(1-0.33) = log(0.33) \approx \mathbf{-0.4815}\]

Notice that as our predictions go from terrible to bad to good to great, the log loss decreases in absolute value (i.e., it gets closer to zero). A perfect classifier would have a log loss of precisely zero. Less ideal classifiers have progressively larger values of log loss.

We can calculate log loss in R using the LogLoss() function from the MLmetrics package, which uses the following syntax:

Syntax

MLmetrics::LogLoss(y_pred, y_true)

  • Required arguments

    • y_pred: An atomic vector with the model’s predicted probabilities.

    • y_true: An atomic vector with the true labels, represented numerically as 0 / 1.

The predictions for model1 are stored in the column validate$model1Prob, and the true values are stored in validate$subscription. The cell below uses LogLoss() to calculate the log loss of the first model:

LogLoss(validate$model1Prob, validate$subscription)
0.285120689258714

Applying the same code to the second model:

LogLoss(validate$model2Prob, validate$subscription)
0.251850851930342

Because our second model has a lower log loss on the validation set, we conclude it is superior at predicting subscription.