{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "e989fbf9", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Registered S3 methods overwritten by 'ggplot2':\n", " method from \n", " [.quosures rlang\n", " c.quosures rlang\n", " print.quosures rlang\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Registered S3 method overwritten by 'rvest':\n", " method from\n", " read_xml.response xml2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "-- Attaching packages --------------------------------------- tidyverse 1.2.1 --\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "v ggplot2 3.1.1 v purrr 0.3.2 \n", "v tibble 2.1.1 v dplyr 0.8.0.1\n", "v tidyr 0.8.3 v stringr 1.4.0 \n", "v readr 1.3.1 v forcats 0.4.0 \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "-- Conflicts ------------------------------------------ tidyverse_conflicts() --\n", "x dplyr::filter() masks stats::filter()\n", "x dplyr::lag() masks stats::lag()\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Parsed with column specification:\n", "cols(\n", " age = col_double(),\n", " marital = col_character(),\n", " education = col_character(),\n", " default = col_character(),\n", " housing = col_character(),\n", " loan = col_character(),\n", " contact = col_character(),\n", " duration = col_double(),\n", " campaign = col_double(),\n", " previous = col_double(),\n", " poutcome = col_character(),\n", " subscription = col_double()\n", ")\n" ] } ], "source": [ "library(tidyverse)\n", "library(ggplot2)\n", "deposit <- read_csv(\"../_build/data/marketing.csv\")" ] }, { "cell_type": "markdown", "id": "16f6c8c4", "metadata": {}, "source": [ "# Why Not Linear Regression?\n", "\n", "At this point, a natural question might be why one cannot use linear regression to model categorical outcomes. To understand why, let's look at an example. We have a data set related to a telephone marketing campaign of a Portuguese banking institution. Suppose we would like to model the likelihood that a phone call recipient will make a term deposit at the bank. The data is saved in a data frame called `deposit`, and the first few observations are shown below:" ] }, { "cell_type": "code", "execution_count": 2, "id": "36636052", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
age | marital | education | default | housing | loan | contact | duration | campaign | previous | poutcome | subscription |
---|---|---|---|---|---|---|---|---|---|---|---|
30 | married | primary | no | no | no | cellular | 79 | 1 | 0 | unknown | 0 |
33 | married | secondary | no | yes | yes | cellular | 220 | 1 | 4 | failure | 0 |
35 | single | tertiary | no | yes | no | cellular | 185 | 1 | 1 | failure | 0 |
30 | married | tertiary | no | yes | yes | unknown | 199 | 4 | 0 | unknown | 0 |
59 | married | secondary | no | yes | no | unknown | 226 | 1 | 0 | unknown | 0 |
35 | single | tertiary | no | no | no | cellular | 141 | 2 | 3 | failure | 0 |