6. Introduction to Machine Learning¶
Machine learning is a field at the intersection of computer science and statistics which studies algorithms that learn from patterns observed in data. Sometimes the goal is to build a model that can predict unknown outcomes, while other times machine learning is used to explore the natural patterns or characteristics of a data set. The field is generally organized into three main approaches:
Supervised learning, where the goal is to build a model that can predict uncertain outcomes. Prediction is a process that uses historical data to forecast future or unknown events (e.g., using January’s sale data to determine which customers are likely to return in February). The growth in predictive machine learning models is in large part responsible for the increased uptake of data science in so many industry sectors. Today, predictive models affect most aspects of our everyday life, from Netflix’s recommendation algorithm to Google’s search engine, or even to food placement on grocery store shelves. Mastering these methods will allow one to develop forecasts that apply across various business problems and industries. The majority of the material covered in this book concerns supervised machine learning.
Unsupervised learning, where the goal is to observe any patterns or structures that might emerge from a data set. For example, an online marketing company may be interested in grouping consumers into different segments based on those consumers’ characteristics, such as their demographics, shopping behaviors, geographic location, etc. Unsupervised machine learning could be used to cluster customers into different groups based on their similarities across these features.
Reinforcement learning, where a computer program operates in a dynamic environment and repeatedly receives feedback on its performance at some target task. For example, self-driving cars are trained using reinforcement learning. In the current edition, reinforcement learning is not covered here.
Supervised learning is generally divided into two main types of prediction problems: classification and regression. Many machine learning algorithms can be used for either classification or regression, although some can only be used for one or the other. In this section, we will focus on classification problems, although most of the algorithms we will cover (decision trees, random forests, and neural networks) can be used for both.
In classification problems the outcome of interest is discrete, meaning one is trying to predict the category an observation belongs to. The following are some examples of classification problems:
A spam detection system that classifies incoming emails as spam or not spam.
A bank using an algorithm to predict whether potential lenders will default on or repay their loan.
A sorting machine on an assembly line that identifies apples, pears, and oranges and sorts them into different bins.
An HR department that conducts an employee engagement survey and uses the data to predict who will stay and leave in the upcoming year.
In regression problems the outcome of interest is continuous, so the goal is to make a prediction as close to the true value as possible. The following are some examples of regression problems:
A paint company that wants to predict its total revenue in the upcoming quarter.
An investor trying to predict the stock price of a company over the next six months.
An online marketing company that wants to predict the number of shares a sponsored post will receive.
6.1. To Explain or To Predict?¶
The chapter Linear Regression demonstrates how one can model a continuous outcome variable (\(Y\)) as a function of one or more independent variables (\(X\)). In particular, it is shown that linear regression can be used both to estimate the relationship between \(X\) and \(Y\), and to predict the value of \(Y\) for uncertain observations. Recall that with linear regression, we assume the true relationship between \(Y\) and \(X\) is of the following form:
By applying the method of least squares to our sample data, we can estimate the values of the parameters in the equation above, producing the following estimated regression equation:
The parameters of this model (\(b_0\), \(b_1\), etc.) provide estimates of the true, underlying relationship between the variables of interest. Additionally, we can use the model to predict the value of \(Y\) for new observations; by plugging the appropriate values of \(X\) into the estimated regression equation, we will get a prediction for \(Y\). Therefore, linear regression can be used for two main purposes: explaining the relationship between \(X\) and \(Y\), and predicting uncertain values of \(Y\).
The distinction between these two objectives was highlighted in a 2001 article by statistician Leo Breiman, titled Statistical Modeling: The Two Cultures. In the article, Breiman explains that the field of statistics has traditionally been characterized by a “data modeling culture” in which the emphasis was placed on models which, “[assume] that the data are generated by a given stochastic data model.” In other words, the traditional approach in statistics was to assume an underlying data generating process, such as the one shown in equation (6.1), and then estimate the parameters of that process using sample data (as in equation (6.2)). Such models can be used to make predictions, but their primary purpose is to estimate the underlying relationships between the variables of interest.
In contrast, Breiman describes the “algorithmic modeling culture” that characterizes the newer field of machine learning. Under this approach, less importance is placed on estimating the true, underlying relationship between \(X\) and \(Y\). Instead, the goal is simply to discover a procedure that can accurately predict \(Y\) as a function of \(X\), regardless of whether that process correctly describes the relationship between \(X\) and \(Y\). The key distinction is that these methods do not assume an underlying model (such as equation (6.1)) that defines how the data were generated; instead, they produce a set of rules (or an “algorithm”) that predicts \(Y\) based on the values of \(X\). Consequently, this type of model does not always allow us to infer much about the structural relationship between our variables. Instead, we simply use it to make predictions about new observations. The decision tree model covered later in this note is an example of this approach.
The choice of which modeling approach to take is highly context dependent. In a 2019 article in the Harvard Data Science Review, Nathan Sanders emphasized the importance of adopting a balanced perspective on prediction and inference in industry settings. As an example from the entertainment industry, he describes how both approaches are necessary for modeling the box office returns of feature films. For the purpose of financial modeling, the primary objective is to forecast the revenue of an upcoming film as accurately as possible. Here less emphasis is placed on understanding how different characteristics of the film relate to revenue; all that matters is that one can accurately predict the film’s ticket sales. In other contexts, studios need to take a more inferential approach to understand how the characteristics of a film relate to box office success. For example, by understanding the relationship between the marketing budget of a film and ticket sales, firms can adjust their marketing strategy to optimize revenue.
As we have seen, there is not always a clean separation between the two approaches outlined in Breiman’s article. Although linear regression falls primarily under the “data modeling culture”, it can also be used to make predictions. The same is true of the logistic regression model, which is covered in the next section. The decision tree model covered in a later section falls more clearly under the ”algorithmic modeling culture”.