18 Slices
Medium 9781449303716

11. Analyzing Social Graphs

Drew Conway O'Reilly Media ePub

Social networks are everywhere. According to Wikipedia, there are over 200 active social networking websites on the Internet, excluding dating sites. As you can see from Figure11-1, according to Google Trends there has been a steady and constant rise in global interest in social networks since 2005. This is perfectly reasonable: the desire for social interaction is a fundamental part of human nature, and it should come as no surprise that this innate social nature would manifest in our technologies. But the mapping and modeling of social networks is by no means news.

In the mathematics community, an example of social network analysis at work is the calculation of a persons Erds number, which measures her distance from the prolific mathematician Paul Erds. Erds was arguably the most prolific mathematician of the 20th century and published over 1,500 papers during his career. Many of these papers were coauthored, and Erds numbers measure a mathematicians distance from the circle of coauthors that Erds enlisted. If a mathematician coauthored with Erds on a paper, then she would have an Erds number of one, i.e., her distance to Erds in the network of 20th-century mathematics is one. If another author collaborated with one of Erds coauthors but not with Erds directly, then that author would have an Erds number of two, and so on. This metric has been used, though rarely seriously, as a crude measure of a persons prominence in mathematics. Erds numbers allow us to quickly summarize the massive network of mathematicians orbiting around Paul Erds.

See All Chapters
Medium 9781449303716

6. Regularization: Text Regression

Drew Conway O'Reilly Media ePub

While we told you the truth in Chapter5 when we said that linear regression assumes that the relationship between two variables is a straight line, it turns out you can also use linear regression to capture relationships that arent well-described by a straight line. To show you what we mean, imagine that you have the data shown in panel A of Figure6-1.

Figure6-1.Modeling nonlinear data: (A) visualizing nonlinear relationships; (B) nonlinear relationships and linear regression; (C) structured residuals; (D) results from a generalized additive model

Its obvious from looking at this scatterplot that the relationship between X and Y isnt well-described by a straight line. Indeed, plotting the regression line shows us exactly what will go wrong if we try to use a line to capture the pattern in this data; panel B of Figure6-1 shows the result.

We can see that we make systematic errors in our predictions if we use a straight line: at small and large values of x, we overpredict y, and we underpredict y for medium values of x. This is easiest to see in a residuals plot, as shown in panel C of Figure6-1. In this plot, you can see all of the structure of the original data set, as none of the structure is captured by the default linear regression model.

See All Chapters
Medium 9781449314309

1. Using R

Drew Conway O'Reilly Media ePub

Machine learning exists at the intersection of traditional mathematics and statistics with software engineering and computer science. In this book, we will describe several tools from traditional statistics that allow you to make sense of that world. Statistics has almost always been concerned with learning something interpretable from data, while machine learning has been concerned with turning data into something practical and usable. This contrast makes it easier to understand the term machine learning: Machine learning is concerned with teaching computers something about the world, so that they can use that knowledge to perform other tasks, while statistics is more concerned with developing tools for teaching humans something about the world, so that they can think more clearly about the world in order to make better decisions.

In machine learning, the learning occurs by extracting as much information from the data as possible (or reasonable) through algorithms that parse the basic structure of the data and distinguish the signal from the noise. After they have found the signal, or pattern, the algorithms simply decide that everything else thats left over is noise. For that reason, machine learning techniques are also referred to as pattern recognition algorithms. We can train our machines to learn about how data is generated in a given context, which allows us to use these algorithms to automate many useful tasks. This is where the term training set comes from, referring to the set of data used to build a machine learning process. The notion of observing data, learning from it, and then automating some process of recognition is at the heart of machine learning, and forms the primary arc of this book.

See All Chapters
Medium 9781449303716

Works Cited

Drew Conway O'Reilly Media ePub

[Adl10] JosephAdler. R in a Nutshell. OReilly Media, 2010.

[Abb92] EdwinAAbbot Flatland: A Romance of Many Dimensions. Dover Publications, 1992.

[Bis06] ChristopherMBishop Pattern Recognition and Machine Learning. Springer; 1st ed. 2006. Corr.; 2nd printing ed. 2007.

[GH06] AndrewGelmanJenniferHill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2006.

[HTF09] TrevorHastieRobertTibshiraniJeromeFriedman. The Elements of Statistical Learning. Springer, 2009.

[JMR09] OwenJonesRobertMaillardetAndrewRobinson. Introduction to Scientific Programming and Simulation Using R. Chapman and Hall, 2009.

[Seg07] TobySegaran. Programming Collective Intelligence: Building Smart Web 2.0 Applications. OReilly Media, 2007.

[Spe08] PhilSpector. Data Manipulation with R. Springer, 2008.

[Wic09] HadleyWickham. ggplot2: Elegant Graphics for Data Analysis. Springer, 2009.

[Wil05] LelandWilkinson. The Grammar of Graphics. Springer, 2005.

See All Chapters
Medium 9781449303716

5. Regression: Predicting Page Views

Drew Conway O'Reilly Media ePub

In the abstract, regression is a very simple concept: you want to predict one set of numbers given another set of numbers. For example, actuaries might want to predict how long a person will live given their smoking habits, while meteorologists might want to predict the next days temperature given the previous days temperature. In general, well call the numbers youre given inputs and the numbers you want to predict outputs. Youll also sometimes hear people refer to the inputs as predictors or features.

What makes regression different from classification is that the outputs are really numbers. In classification problems like those we described in Chapter3, you might use numbers as a dummy code for a categorical distinction so that 0 represents ham and 1 represents spam. But these numbers are just symbols; were not exploiting the numberness of 0 or 1 when we use dummy variables. In regression, the essential fact about the outputs is that they really are numbers: you want to predict things like temperatures, which could be 50 degrees or 71 degrees. Because youre predicting numbers, you want to be able to make strong statements about the relationship between the inputs and the outputs: you might want to say, for example, that when the number of packs of cigarettes a person smokes per day doubles, their predicted life span gets cut in half.

See All Chapters

See All Slices