Medium 9781449303716

Machine Learning for Hackers

Views: 2521
Ratings: (0)

If you’re an experienced programmer interested in crunching data, this book will get you started with machine learning—a toolkit of algorithms that enables computers to train themselves to automate useful tasks. Authors Drew Conway and John Myles White help you understand machine learning and statistics tools through a series of hands-on case studies, instead of a traditional math-heavy presentation.

Each chapter focuses on a specific problem in machine learning, such as classification, prediction, optimization, and recommendation. Using the R programming language, you’ll learn how to analyze sample datasets and write simple machine learning algorithms. Machine Learning for Hackers is ideal for programmers from any background, including business, government, and academic research.

  • Develop a naïve Bayesian classifier to determine if an email is spam, based only on its text
  • Use linear regression to predict the number of page views for the top 1,000 websites
  • Learn optimization techniques by attempting to break a simple letter cipher
  • Compare and contrast U.S. Senators statistically, based on their voting records
  • Build a “whom to follow” recommendation system from Twitter data

List price: $42.99

Your Price: $34.39

You Save: 20%

 

13 Slices

Format Buy Remix

1. Using R

ePub

Machine learning exists at the intersection of traditional mathematics and statistics with software engineering and computer science. In this book, we will describe several tools from traditional statistics that allow you to make sense of the world. Statistics has almost always been concerned with learning something interpretable from data, whereas machine learning has been concerned with turning data into something practical and usable. This contrast makes it easier to understand the term machine learning: Machine learning is concerned with teaching computers something about the world, so that they can use that knowledge to perform other tasks. In contrast, statistics is more concerned with developing tools for teaching humans something about the world, so that they can think more clearly about the world in order to make better decisions.

In machine learning, the learning occurs by extracting as much information from the data as possible (or reasonable) through algorithms that parse the basic structure of the data and distinguish the signal from the noise. After they have found the signal, or pattern, the algorithms simply decide that everything else thats left over is noise. For that reason, machine learning techniques are also referred to as pattern recognition algorithms. We can train our machines to learn about how data is generated in a given context, which allows us to use these algorithms to automate many useful tasks. This is where the term training set comes from, referring to the set of data used to build a machine learning process. The notion of observing data, learning from it, and then automating some process of recognition is at the heart of machine learning and forms the primary arc of this book. Two particularly important types of patterns constitute the core problems well provide you with tools to solve: the problem of classification and the problem of regression, which will be introduced over the course of this book.

 

2. Data Exploration

ePub

Whenever you work with data, its helpful to imagine breaking up your analysis into two completely separate parts: exploration and confirmation. The distinction between exploratory data analysis and confirmatory data analysis comes down to us from the famous John Tukey,[6] who emphasized the importance of designing simple tools for practical data analysis. In Tukeys mind, the exploratory steps in data analysis involve using summary tables and basic visualizations to search for hidden patterns in your data. In this chapter, we describe some of the basic tools that R provides for summarizing your data numerically, and then we teach you how to make sense of the results. After that, we show you some of the tools that exist in R for visualizing your data, and at the same time, we give you a whirlwind tour of the basic visual patterns that you should keep an eye out for in any gization.

But before you start searching through your first data set, we should warn you about a real danger thats present whenever you explore data: youre likely to find patterns that arent really there. The human mind is designed to find patterns in the world and will do so even when those patterns are just quirks of chance. You dont need a degree in statistics to know that we human beings will easily find shapes in clouds after looking at them for only a few seconds. And plenty of people have convinced themselves that theyve discovered hidden messages in run-of-the-mill texts like Shakespeares plays. Because humans are vulnerable to discovering patterns that wont stand up to careful scrutiny, the exploratory step in data analysis cant exist in isolation; it needs to be accompanied by a confirmatory step. Think of confirmatory data analysis as a sort of mental hygiene routine that we use to clean off our beliefs about the world after weve gone slogging through the messy and sometimes lawless world of exploratory data visualization.

 

3. Classification: Spam Filtering

ePub

At the very end of Chapter2, we quickly presented an example of classification. We used heights and weights to predict whether a person was a man or a woman. With our example graph, we were able to draw a line that split the data into two groups: one group where we would predict male and another group where we would predict female. This line was called a separating hyperplane, but from now on well use the term decision boundary, because well be working with data that cant be classified properly using only straight lines. For example, imagine that your data looked like the data set shown in Example3-1.

This plot might depict people who are at risk for a certain ailment and those who are not. Above and below the black horizontal lines we might predict that a person is at risk, but inside we would predict good health. These black lines are thus our decision boundary. Suppose that the blue dots represent healthy people and the red dots represent people who suffer from a disease. If that were the case, the two black lines would work quite well as a decision boundary for classifying people as either healthy or sick.

 

4. Ranking: Priority Inbox

ePub

In Chapter3 we discussed in detail the concept of binary classificationthat is, placing items into one of two types or classes. In many cases, we will be satisfied with an approach that can make such a distinction. But what if the items in one class are not created equally and we want to rank the items within a class? In short, what if we want to say that one email is the most spammy and another is the second most spammy, or we want to distinguish among them in some other meaningful way? Suppose we not only wanted to filter spam from our email, but we also wanted to place more important messages at the top of the queue. This is a very common problem in machine learning, and it will be the focus of this chapter.

Generating rules for ranking a list of items is an increasingly common task in machine learning, yet you may not have thought of it in these terms. More likely, you have heard of something like a recommendation system, which implicitly produces a ranking of products. Even if you have not heard of a recommendation system, its almost certain that you have used or interacted with a recommendation system at some point. Some of the most successful ecommerce websites have benefited from leveraging data on their users to generate recommendations for other products their users might be interested in.

 

5. Regression: Predicting Page Views

ePub

In the abstract, regression is a very simple concept: you want to predict one set of numbers given another set of numbers. For example, actuaries might want to predict how long a person will live given their smoking habits, while meteorologists might want to predict the next days temperature given the previous days temperature. In general, well call the numbers youre given inputs and the numbers you want to predict outputs. Youll also sometimes hear people refer to the inputs as predictors or features.

What makes regression different from classification is that the outputs are really numbers. In classification problems like those we described in Chapter3, you might use numbers as a dummy code for a categorical distinction so that 0 represents ham and 1 represents spam. But these numbers are just symbols; were not exploiting the numberness of 0 or 1 when we use dummy variables. In regression, the essential fact about the outputs is that they really are numbers: you want to predict things like temperatures, which could be 50 degrees or 71 degrees. Because youre predicting numbers, you want to be able to make strong statements about the relationship between the inputs and the outputs: you might want to say, for example, that when the number of packs of cigarettes a person smokes per day doubles, their predicted life span gets cut in half.

 

6. Regularization: Text Regression

ePub

While we told you the truth in Chapter5 when we said that linear regression assumes that the relationship between two variables is a straight line, it turns out you can also use linear regression to capture relationships that arent well-described by a straight line. To show you what we mean, imagine that you have the data shown in panel A of Figure6-1.

Figure6-1.Modeling nonlinear data: (A) visualizing nonlinear relationships; (B) nonlinear relationships and linear regression; (C) structured residuals; (D) results from a generalized additive model

Its obvious from looking at this scatterplot that the relationship between X and Y isnt well-described by a straight line. Indeed, plotting the regression line shows us exactly what will go wrong if we try to use a line to capture the pattern in this data; panel B of Figure6-1 shows the result.

We can see that we make systematic errors in our predictions if we use a straight line: at small and large values of x, we overpredict y, and we underpredict y for medium values of x. This is easiest to see in a residuals plot, as shown in panel C of Figure6-1. In this plot, you can see all of the structure of the original data set, as none of the structure is captured by the default linear regression model.

 

7. Optimization: Breaking Codes

ePub

So far weve treated most of the algorithms in this book as partial black boxes, in that weve focused on understanding the inputs youre expected to use and the outputs youll get. Essentially, weve treated machine learning algorithms as a library of functions for performing prediction tasks.

In this chapter, were going to examine some of the techniques that are used to implement the most basic machine learning algorithms. As a starting point, were going to put together a function for fitting simple linear regression models with only one predictor. That example will allow us to motivate the idea of viewing the act of fitting a model to data as an optimization problem. An optimization problem is one in which we have a machine with some knobs that we can turn to change the machines settings and a way to measure how well the machine is performing with the current settings. We want to find the best possible settings, which will be those that maximize some simple measure of how well the machine is performing. That point will be called the optimum. Reaching it will be called optimization.

 

8. PCA: Building a Market Index

ePub

So far, all of our work with data has been based on prediction tasks: weve tried to classify emails or web page views where we had a training set of examples for which we knew the correct answer. As we mentioned early on in this book, learning from data when we have a training sample with the correct answer is called supervised learning: we find structure in our data using a signal that tells us whether or not were doing a good job of discovering real patterns.

But often we want to find structure without having any answers available to us about how well were doing; we call this unsupervised learning. For example, we might want to perform dimensionality reduction, which happens when we shrink a table with a huge number of columns into a table with a small number of columns. If you have too many columns to deal with, this dimensionality reduction goes a long way toward making your data set comprehensible. Although you clearly lose information when you replace many columns with a single column, the gains in understanding are often valuable, especially when youre exploring a new data set.

 

9. MDS: Visually Exploring US Senator Similarity

ePub

There are many situations where we might want to know how similar the members of a group of people are to one another. For instance, suppose that we were a brand marketing company that had just completed a research survey on a potential new brand. In the survey, we showed a group of people several features of the brand and asked them to rank the brand on each of these features using a five-point scale. We also collected a bunch of socioeconomic data from the subjects, such as age, gender, race, what zip code they live in, and their approximate annual income.

From this survey, we want to understand how the brand appeals across all of these socioeconomic variables. Most importantly, we want to know whether the brand has broad appeal. An alternative way of thinking about this problem is we want to see whether individuals who like most of the brand features have diverse socioeconomic features. A useful means of doing this would be to visualize how the survey respondents cluster. We could then use various visual cues to indicate their memberships in different socioeconomic categories. That is, we would want to see a large amount of mixing between gender, as well as among races and economic stratification.

 

10. kNN: Recommendation Systems

ePub

In the last chapter, we saw how we could use simple correlational techniques to create a measure of similarity between the members of Congress based on their voting records. In this chapter, were going to talk about how you can use those same sort of similarity metrics to recommend items to a websites users.

The algorithm well use is called k-nearest neighbors. Its arguably the most intuitive of all the machine learning algorithms that we present in this book. Indeed, the simplest form of k-nearest neighbors is the sort of algorithm most people would spontaneously invent if asked to make recommendations using similarity data: theyd recommend the song thats closest to the songs a user already likes, but not yet in that list. That intuition is essentially a 1-nearest neighbor algorithm. The full k-nearest neighbor algorithm amounts to a generalization of this intuition where you draw on more than one data point before making a recommendation.

 

11. Analyzing Social Graphs

ePub

Social networks are everywhere. According to Wikipedia, there are over 200 active social networking websites on the Internet, excluding dating sites. As you can see from Figure11-1, according to Google Trends there has been a steady and constant rise in global interest in social networks since 2005. This is perfectly reasonable: the desire for social interaction is a fundamental part of human nature, and it should come as no surprise that this innate social nature would manifest in our technologies. But the mapping and modeling of social networks is by no means news.

In the mathematics community, an example of social network analysis at work is the calculation of a persons Erds number, which measures her distance from the prolific mathematician Paul Erds. Erds was arguably the most prolific mathematician of the 20th century and published over 1,500 papers during his career. Many of these papers were coauthored, and Erds numbers measure a mathematicians distance from the circle of coauthors that Erds enlisted. If a mathematician coauthored with Erds on a paper, then she would have an Erds number of one, i.e., her distance to Erds in the network of 20th-century mathematics is one. If another author collaborated with one of Erds coauthors but not with Erds directly, then that author would have an Erds number of two, and so on. This metric has been used, though rarely seriously, as a crude measure of a persons prominence in mathematics. Erds numbers allow us to quickly summarize the massive network of mathematicians orbiting around Paul Erds.

 

12. Model Comparison

ePub

In Chapter3, we introduced the idea of decision boundaries and noted that problems in which the decision boundary isnt linear pose a problem for simple classification algorithms. In Chapter6, we showed you how to perform logistic regression, a classification algorithm that works by constructing a linear decision boundary. And in both chapters, we promised to describe a technique called the kernel trick that could be used to solve problems with nonlinear decision boundaries. Lets deliver on that promise by introducing a new classification algorithm called the support vector machine (SVM for short), which allows you to use multiple different kernels to find nonlinear decision boundaries. Well use an SVM to classify points from a data set with a nonlinear decision boundary. Specifically, well work with the data set shown in Figure12-1.

Looking at this data set, it should be clear that the points from Class 0 are on the periphery, whereas points from Class 1 are in the center of the plot. This sort of nonlinear decision boundary cant be discovered using a simple classification algorithm like the logistic regression algorithm we described in Chapter6. Lets demonstrate that by trying to use logistic regression through the glm function. Well then look into the reason why logistic regression fails.

 

Works Cited

ePub

[Adl10] JosephAdler. R in a Nutshell. OReilly Media, 2010.

[Abb92] EdwinAAbbot Flatland: A Romance of Many Dimensions. Dover Publications, 1992.

[Bis06] ChristopherMBishop Pattern Recognition and Machine Learning. Springer; 1st ed. 2006. Corr.; 2nd printing ed. 2007.

[GH06] AndrewGelmanJenniferHill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2006.

[HTF09] TrevorHastieRobertTibshiraniJeromeFriedman. The Elements of Statistical Learning. Springer, 2009.

[JMR09] OwenJonesRobertMaillardetAndrewRobinson. Introduction to Scientific Programming and Simulation Using R. Chapman and Hall, 2009.

[Seg07] TobySegaran. Programming Collective Intelligence: Building Smart Web 2.0 Applications. OReilly Media, 2007.

[Spe08] PhilSpector. Data Manipulation with R. Springer, 2008.

[Wic09] HadleyWickham. ggplot2: Elegant Graphics for Data Analysis. Springer, 2009.

[Wil05] LelandWilkinson. The Grammar of Graphics. Springer, 2005.

 

Details

Print Book
E-Books
Slices

Format name
ePub
Encrypted
No
Sku
9781449330538
Isbn
9781449330538
File size
1 KB
Printing
Not Allowed
Copying
Not Allowed
Read aloud
No
Format name
ePub
Encrypted
No
Printing
Allowed
Copying
Allowed
Read aloud
Allowed
Sku
In metadata
Isbn
In metadata
File size
In metadata