7.4. Building Logistic Regression Models¶

In practice, we typically build logistic regression models so that we can make predictions about observations in the population of interest. Therefore, we want to do more than just fit logistic regression models to the data at hand; we also need to determine how accurate those models are at making predictions. To accomplish this, data scientists split the available data into distinct sets. A large portion of the data is used to build models, while a smaller portion of the data is reserved to determine the accuracy of those models. This process will be described in detail in the section Partitioning Data, but for now, we will simply split our data into two sets: a training set for building models and a validation set for estimating the accuracy of those models.

So far we have built our logistic regression models on the full data set stored in the data frame deposit. Here we randomly split deposit into two data frames: train, which contains 80% of the observations from deposit, and validate, which contains 20% of the observations from deposit. Now we will build our logistic regression models on train, and then calculate the predictive accuracy of those models on validate.

First, let’s build a simple logistic regression on train:

model1 <- glm(subscription ~ duration, data = train, family = "binomial")

Next, we will use predict() to apply this model to the validate set. We will store the predictions in a new column in validate called model1Prob:

validate$model1Prob <- predict(model1, validate, type = "response")
head(validate)

age	marital	education	default	housing	loan	contact	duration	campaign	previous	poutcome	model1Prob
35	single	tertiary	no	yes	no	cellular	185	1	1	failure	0.07061358
36	married	tertiary	no	yes	no	cellular	341	1	2	other	0.11420667
39	married	secondary	no	yes	no	cellular	151	2	0	unknown	0.06341384
44	single	secondary	no	no	no	unknown	109	2	0	unknown	0.05546522
44	married	secondary	no	no	no	cellular	125	2	0	unknown	0.05837620
55	married	primary	no	yes	no	unknown	247	1	0	unknown	0.08571404

Now let’s build a multiple logistic regression model on train using all of the available variables, and determine that model’s predictions in validate. Note that if we want to include all possible $X$ variables in the model, we do not need to type out each variable name; instead we can use the period character (.) after the tilde (~) to indicate we want to use every variable in the data frame:

# Build model
model2 <- glm(subscription ~  ., data = train, family = "binomial")

# Apply model to validate set
validate$model2Prob <- predict(model2, validate, type = "response")

# Output first few observations
head(validate)

age	marital	education	default	housing	loan	contact	duration	campaign	previous	poutcome	model1Prob	model2Prob
35	single	tertiary	no	yes	no	cellular	185	1	1	failure	0.07061358	0.11232778
36	married	tertiary	no	yes	no	cellular	341	1	2	other	0.11420667	0.22667670
39	married	secondary	no	yes	no	cellular	151	2	0	unknown	0.06341384	0.03885416
44	single	secondary	no	no	no	unknown	109	2	0	unknown	0.05546522	0.02570542
44	married	secondary	no	no	no	cellular	125	2	0	unknown	0.05837620	0.06058859
55	married	primary	no	yes	no	unknown	247	1	0	unknown	0.08571404	0.02003328

Now that we have applied our models to the validate set, how do we compare their performance? We will compare them on the validation set using a metric known as log loss.

7.4.1. Log Loss¶

Classification models are often scored with the log loss metric, which is defined as:

\[Log\ Loss = -\frac{1}{n}\sum^{n}_{i=1}[y_ilog(\hat{p_i}) + (1 - y_i)log(1 - \hat{p_i})]\]

where

$n$ is the number of observations in the data set;
$y_i$ is the observed realization of observation $i$; in this context it equals 1 if the person made a deposit and 0 if not; and
$\hat{p_i}$ is the predicted probability that observation $i$ will make a deposit according to the model.

To help understand log loss, think through the following scenarios:

Great predictions
- $y_i$ = 1 & $\hat{p_i}$ = 0.99: the observation actually made a deposit, and the model estimated the probability of making a deposit to be 99%.
\[logloss_i = 1log(0.99) + (1-1)log(1-0.99) = log(0.99) \approx \mathbf{-0.004}\]
- $y_i$ = 0 & $\hat{p_i}$ = 0.01: the observation did not make a deposit, and the model estimated the probability of making a deposit to be 1%.
\[logloss_i = 0log(0.01) + (1-0)log(1-0.01) = log(0.99) \approx \mathbf{-0.004}\]
Terrible predictions
- $y_i$ = 0 & $\hat{p_i}$ = 0.99: the observation did not make a deposit, but the model estimated the probability of making a deposit to be 99%.
\[logloss_i = 0log(0.99) + (1-0)log(1-0.99) = log(0.01) = \mathbf{-2}\]
- $y_i$ = 1 & $\hat{p_i}$ = 0.01: the observation actually made a deposit, but the model estimated the probability of making a deposit to be only 1%.
\[logloss_i = 1log(0.01) + (1-1)log(1-0.01) = log(0.01) = \mathbf{-2}\]

From these examples, it is clear that the absolute value of the log loss is small when the model is close to the truth and large when the model is far off. Now consider the following:

Good (not great) predictions
- $y_i$ = 1 & $\hat{p_i}$ = 0.66: the observation actually made a deposit, and the model estimated the probability of making a deposit to be 66%.
\[logloss_i = 1log(0.66) + (1-1)log(1-0.66) = log(0.66) \approx \mathbf{-0.1805}\]
- $y_i$ = 0 & $\hat{p_i}$ = 0.33: the observation did not make a deposit, and the model estimated the probability of making a deposit to be 33%.
\[logloss_i = 0log(0.33) + (1-0)log(1-0.33) = log(0.66) \approx \mathbf{-0.1805}\]
Bad (not terrible) predictions
- $y_i$ = 0 & $\hat{p_i}$ = 0.66: the observation did not make a deposit, but the model estimated the probability of making a deposit to be 66%.
\[logloss_i = 0log(0.66) + (1-0)log(1-0.66) = log(0.33) \approx \mathbf{-0.4815}\]
- $y_i$ = 1 & $\hat{p_i}$ = 0.33: the observation actually made a deposit, but the model estimated the probability of making a deposit to be only 33%.
\[logloss_i = 1log(0.33) + (1-1)log(1-0.33) = log(0.33) \approx \mathbf{-0.4815}\]

Notice that as our predictions go from terrible to bad to good to great, the log loss decreases in absolute value (i.e., it gets closer to zero). A perfect classifier would have a log loss of precisely zero. Less ideal classifiers have progressively larger values of log loss.

We can calculate log loss in R using the LogLoss() function from the MLmetrics package, which uses the following syntax:

Syntax

MLmetrics::LogLoss(y_pred, y_true)

Required arguments
- y_pred: An atomic vector with the model’s predicted probabilities.
- y_true: An atomic vector with the true labels, represented numerically as 0 / 1.

The predictions for model1 are stored in the column validate$model1Prob, and the true values are stored in validate$subscription. The cell below uses LogLoss() to calculate the log loss of the first model:

LogLoss(validate$model1Prob, validate$subscription)

0.285120689258714

Applying the same code to the second model:

LogLoss(validate$model2Prob, validate$subscription)

0.251850851930342

Because our second model has a lower log loss on the validation set, we conclude it is superior at predicting subscription.

Data Science for Managers

7.4. Building Logistic Regression Models¶

7.4.1. Log Loss¶