Machine learning exists at the intersection of traditional
mathematics and statistics with software engineering and computer science.
In this book, we will describe several tools from traditional statistics
that allow you to make sense of the world. Statistics has almost always
been concerned with learning something interpretable from data, whereas
machine learning has been concerned with turning data into something
practical and usable. This contrast makes it easier to understand the term
machine learning: Machine learning is concerned with
teaching computers something about the world, so that
they can use that knowledge to perform other tasks. In contrast,
statistics is more concerned with developing tools for teaching
humans something about the world, so that they can
think more clearly about the world in order to make better
In machine learning, the learning occurs by
extracting as much information from the data as possible (or reasonable)
through algorithms that parse the basic structure of the data and
distinguish the signal from the noise. After they have found the signal,
or pattern, the algorithms simply decide that
everything else thats left over is noise. For that reason, machine learning techniques are also referred
to as pattern recognition algorithms. We can train
our machines to learn about how data is generated in a given context,
which allows us to use these algorithms to automate many useful tasks.
This is where the term training set comes
from, referring to the set of data used to build a machine learning
process. The notion of observing data, learning from it, and then
automating some process of recognition is at the heart of machine learning
and forms the primary arc of this book. Two particularly important types
of patterns constitute the core problems well provide you with tools to
solve: the problem of classification and the problem of regression, which
will be introduced over the course of this book.
Whenever you work with data, its helpful to imagine breaking up
your analysis into two completely separate parts: exploration and
confirmation. The distinction between exploratory data analysis and
confirmatory data analysis comes down to us from the famous John Tukey, who emphasized the importance of designing simple tools
for practical data analysis. In Tukeys mind, the exploratory steps in
data analysis involve using summary tables and basic visualizations to
search for hidden patterns in your data. In this chapter, we describe
some of the basic tools that R provides for summarizing your data
numerically, and then we teach you how to make sense of the results.
After that, we show you some of the tools that exist in R for
visualizing your data, and at the same time, we give you a whirlwind
tour of the basic visual patterns that you should keep an eye out for in
But before you start searching through your first data set,
we should warn you about a real danger thats present whenever you
explore data: youre likely to find patterns that arent really there.
The human mind is designed to find patterns in the world and will do so
even when those patterns are just quirks of chance. You dont need a
degree in statistics to know that we human beings will easily find
shapes in clouds after looking at them for only a few seconds. And
plenty of people have convinced themselves that theyve discovered
hidden messages in run-of-the-mill texts like Shakespeares plays.
Because humans are vulnerable to discovering patterns that wont stand
up to careful scrutiny, the exploratory step in data analysis cant
exist in isolation; it needs to be accompanied by a confirmatory step.
Think of confirmatory data analysis as a sort of mental hygiene routine
that we use to clean off our beliefs about the world after weve gone
slogging through the messy and sometimes lawless world of exploratory
At the very end of Chapter2, we
quickly presented an example of classification. We used heights and
weights to predict whether a person was a man or a woman. With our
example graph, we were able to draw a line that split the data into two
groups: one group where we would predict male and another group where
we would predict female. This line was called a separating hyperplane, but from now on well use the term
decision boundary, because well be working with data that cant be
classified properly using only straight lines. For example, imagine that
your data looked like the data set shown in Example3-1.
This plot might depict people who are at risk for a certain
ailment and those who are not. Above and below the black horizontal
lines we might predict that a person is at risk, but inside we would
predict good health. These black lines are thus our decision boundary.
Suppose that the blue dots represent healthy people and the red dots
represent people who suffer from a disease. If that were the case, the
two black lines would work quite well as a decision boundary for
classifying people as either healthy or sick.
In Chapter3 we
discussed in detail the concept of binary
classificationthat is, placing items into one of two types
or classes. In many cases, we will be satisfied with an approach that
can make such a distinction. But what if the items in one class are not
created equally and we want to rank the items within a class? In short,
what if we want to say that one email is the most spammy and another is
the second most spammy, or we want to distinguish among them in some
other meaningful way? Suppose we not only wanted to filter spam from our
email, but we also wanted to place more important messages at the top
of the queue. This is a very common problem in machine learning, and it
will be the focus of this chapter.
Generating rules for ranking a list of items is an increasingly
common task in machine learning, yet you may not have thought of it in
these terms. More likely, you have heard of something like a
recommendation system, which implicitly produces a
ranking of products. Even if you have not heard of a recommendation
system, its almost certain that you have used or interacted with a
recommendation system at some point. Some of the most successful
ecommerce websites have benefited from leveraging data on their users to
generate recommendations for other products their users might be interested in.
In the abstract, regression is a very simple concept: you
want to predict one set of numbers given another set of numbers. For
example, actuaries might want to predict how long a person will live
given their smoking habits, while meteorologists might want to predict
the next days temperature given the previous days temperature. In
general, well call the numbers youre given inputs and the numbers you want to predict
outputs. Youll also sometimes hear
people refer to the inputs as predictors or features.
What makes regression different from classification is that the
outputs are really numbers. In classification problems like those we
described in Chapter3, you might
use numbers as a dummy code for a categorical distinction so that 0
represents ham and 1 represents spam. But these numbers are just
symbols; were not exploiting the numberness of 0 or 1 when we use
dummy variables. In regression, the essential fact about the outputs is
that they really are numbers: you want to predict things like
temperatures, which could be 50 degrees or 71 degrees. Because youre
predicting numbers, you want to be able to make strong statements about
the relationship between the inputs and the outputs: you might want to
say, for example, that when the number of packs of cigarettes a person
smokes per day doubles, their predicted life span gets cut in
While we told you the truth in Chapter5 when we said that linear
regression assumes that the relationship between two variables is a
straight line, it turns out you can also use linear regression to
capture relationships that arent well-described by a straight line. To
show you what we mean, imagine that you have the data shown in panel A
Figure6-1.Modeling nonlinear data: (A) visualizing nonlinear
relationships; (B) nonlinear relationships and linear regression; (C)
structured residuals; (D) results from a generalized additive
Its obvious from looking at this scatterplot that the
relationship between X and Y isnt well-described by a straight line.
Indeed, plotting the regression line shows us exactly what will go wrong
if we try to use a line to capture the pattern in this data; panel B of
Figure6-1 shows the result.
We can see that we make systematic errors in our predictions if we
use a straight line: at small and large values of
y, and we underpredict
medium values of
x. This is easiest to see in a residuals
plot, as shown in panel C of Figure6-1.
In this plot, you can see all of the structure of the original data set,
as none of the structure is captured by the default linear regression
So far weve treated most of the algorithms in this book as
partial black boxes, in that weve focused on understanding the inputs
youre expected to use and the outputs youll get. Essentially, weve
treated machine learning algorithms as a library of functions for
performing prediction tasks.
In this chapter, were going to examine some of the techniques
that are used to implement the most basic machine learning algorithms.
As a starting point, were going to put together a function for fitting
simple linear regression models with only one predictor. That example
will allow us to motivate the idea of viewing the act of fitting a model
to data as an optimization problem. An optimization problem is one in
which we have a machine with some knobs that we can turn to change the
machines settings and a way to measure how well the machine is
performing with the current settings. We want to find the best possible
settings, which will be those that maximize some simple measure of how
well the machine is performing. That point will be called the optimum.
Reaching it will be called optimization.
So far, all of our work with data has been based on
prediction tasks: weve tried to classify emails or web page views where
we had a training set of examples for which we knew the correct answer.
As we mentioned early on in this book, learning from data when we have a
training sample with the correct answer is called supervised learning:
we find structure in our data using a signal that tells us whether or
not were doing a good job of discovering real patterns.
But often we want to find structure without having any answers
available to us about how well were doing; we call this unsupervised
learning. For example, we might want to perform dimensionality
reduction, which happens when we shrink a table with a huge number of
columns into a table with a small number of columns. If you have too
many columns to deal with, this dimensionality reduction goes a long way
toward making your data set comprehensible. Although you clearly lose
information when you replace many columns with a single column, the
gains in understanding are often valuable, especially when youre
exploring a new data set.
There are many situations where we might want to know how
similar the members of a group of people are to one another. For
instance, suppose that we were a brand marketing company that had just
completed a research survey on a potential new brand. In the survey, we
showed a group of people several features of the brand and asked them to
rank the brand on each of these features using a five-point scale. We
also collected a bunch of socioeconomic data from the subjects, such as
age, gender, race, what zip code they live in, and their approximate
From this survey, we want to understand how the brand appeals
across all of these socioeconomic variables. Most importantly, we want
to know whether the brand has broad appeal. An alternative way of
thinking about this problem is we want to see whether individuals who
like most of the brand features have diverse socioeconomic features. A
useful means of doing this would be to visualize how the survey
respondents cluster. We could then use various visual cues to indicate
their memberships in different socioeconomic categories. That is, we
would want to see a large amount of mixing between gender, as well as
among races and economic stratification.
In the last chapter, we saw how we could use simple
correlational techniques to create a measure of similarity between the
members of Congress based on their voting records. In this chapter,
were going to talk about how you can use those same sort of similarity
metrics to recommend items to a websites users.
The algorithm well use is called k-nearest
neighbors. Its arguably the most intuitive of all the machine learning
algorithms that we present in this book. Indeed, the simplest form of
k-nearest neighbors is the sort of algorithm most
people would spontaneously invent if asked to make recommendations using
similarity data: theyd recommend the song thats closest to the songs a
user already likes, but not yet in that list. That intuition is
essentially a 1-nearest neighbor algorithm. The full
k-nearest neighbor algorithm amounts to a
generalization of this intuition where you draw on more than one data
point before making a recommendation.
Social networks are everywhere. According to Wikipedia, there are
over 200 active social networking websites on the Internet, excluding
dating sites. As you can see from Figure11-1, according to
Google Trends there has been a steady and constant rise in global
interest in social networks since 2005. This is perfectly reasonable:
the desire for social interaction is a fundamental part of human nature,
and it should come as no surprise that this innate social nature would
manifest in our technologies. But the mapping and modeling of social
networks is by no means news.
In the mathematics community, an example of social network
analysis at work is the calculation of a persons Erds number, which
measures her distance from the prolific mathematician Paul Erds. Erds was arguably the most
prolific mathematician of the 20th century and published over 1,500
papers during his career. Many of these papers were coauthored, and
Erds numbers measure a mathematicians distance from the circle of
coauthors that Erds enlisted. If a mathematician coauthored with Erds
on a paper, then she would have an Erds number of one, i.e., her
distance to Erds in the network of 20th-century mathematics is one. If
another author collaborated with one of Erds coauthors but not with
Erds directly, then that author would have an Erds number of two, and
so on. This metric has been used, though rarely seriously, as a crude
measure of a persons prominence in mathematics. Erds numbers allow us
to quickly summarize the massive network of mathematicians orbiting
around Paul Erds.
In Chapter3, we
introduced the idea of decision boundaries and noted that problems in
which the decision boundary isnt linear pose a problem for simple
classification algorithms. In Chapter6, we showed you how to
perform logistic regression, a classification algorithm that works by
constructing a linear decision boundary. And in both chapters, we
promised to describe a technique called the kernel trick that could be
used to solve problems with nonlinear decision boundaries. Lets deliver
on that promise by introducing a new classification algorithm called the
support vector machine (SVM for short), which allows you to use multiple
different kernels to find nonlinear decision boundaries. Well use an
SVM to classify points from a data set with a nonlinear decision
boundary. Specifically, well work with the data set shown in Figure12-1.
Looking at this data set, it should be clear that the points from
Class 0 are on the periphery, whereas points from Class 1 are in the
center of the plot. This sort of nonlinear decision boundary cant be
discovered using a simple classification algorithm like the logistic
regression algorithm we described in Chapter6. Lets demonstrate that by trying to use logistic regression
glm function. Well then look into the reason
why logistic regression fails.
[Adl10] JosephAdler. R in a Nutshell. OReilly Media,
[Abb92] EdwinAAbbot Flatland: A Romance of Many Dimensions.
Dover Publications, 1992.
[Bis06] ChristopherMBishop Pattern Recognition and Machine
Learning. Springer; 1st ed. 2006. Corr.; 2nd printing
[GH06] AndrewGelmanJenniferHill. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge University Press,
[HTF09] TrevorHastieRobertTibshiraniJeromeFriedman. The Elements of Statistical Learning.
[JMR09] OwenJonesRobertMaillardetAndrewRobinson. Introduction to Scientific Programming and
Simulation Using R. Chapman and Hall,
[Seg07] TobySegaran. Programming Collective Intelligence: Building
Smart Web 2.0 Applications. OReilly Media,
[Spe08] PhilSpector. Data Manipulation with R. Springer,
[Wic09] HadleyWickham. ggplot2: Elegant Graphics for Data
[Wil05] LelandWilkinson. The Grammar of Graphics. Springer,