Medium 9780596801700

R in a Nutshell

Views: 2065
Ratings: (0)

Why learn R? Because it's rapidly becoming the standard for developing statistical software. R in a Nutshell provides a quick and practical way to learn this increasingly popular open source language and environment. You'll not only learn how to program in R, but also how to find the right user-contributed R packages for statistical modeling, visualization, and bioinformatics.

The author introduces you to the R environment, including the R graphical user interface and console, and takes you through the fundamentals of the object-oriented R language. Then, through a variety of practical examples from medicine, business, and sports, you'll learn how you can use this remarkable tool to solve your own data analysis problems.

  • Understand the basics of the language, including the nature of R objects
  • Learn how to write R functions and build your own packages
  • Work with data through visualization, statistical analysis, and other methods
  • Explore the wealth of packages contributed by the R community
  • Become familiar with the lattice graphics package for high-level data visualization
  • Learn about bioinformatics packages provided by Bioconductor

"I am excited about this book. R in a Nutshell is a great introduction to R, as well as a comprehensive reference for using R in data analytics and visualization. Adler provides 'real world' examples, practical advice, and scripts, making it accessible to anyone working with data, not just professional statisticians."

List price: $35.99

Your Price: $28.79

You Save: 20%


24 Slices

Format Buy Remix

1. Getting and Installing R


Today, R is maintained by a team of developers around the world. Usually, there is an official release of R twice a year, in April and in October. I used version 2.9.2 in this book. (Actually, it was 2.8.1 when I started writing the book and was updated three times while I was writing. I installed the updates, but they didnt change very much content.)

R hasnt changed that much in the past few years: usually there are some bug fixes, some optimizations, and a few new functions in each release. There have been some changes to the language, but most of these are related to somewhat obscure features that wont affect most users. (For example, the type of NA values in incompletely initialized arrays was changed in R 2.5.) Dont worry about using the exact version of R that I used in this book; any results you get should be very similar to the results shown in this book. If there are any changes to R that affect the examples in this book, Ill try to add them to the official errata online.


2. The R User Interface


If youre reading this book, you probably have a problem that you would like to solve in R. You might want to:

Check the statistical significance of experimental results

Plot some data to help understand it better

Analyze some genome data

The R system is a software environment for statistical computing and graphics. It includes many different components. In this book, Ill use the term R to refer to a few different things:

A computer language

The interpreter that executes code written in R

A system for plotting computer graphics described using the R language

The Windows, Mac OS, or Linux application that includes the interpreter, graphics system, standard packages, and user interface

This chapter contains a short description of the R user interface and the R console, and describes how R varies on different platforms. If youve never used an interactive language, this chapter will explain some basic things you will need to know in order to work with R. Well take a quick look at the R graphical user interface (GUI) on each platform and then talk about the most important part: the R console.


3. A Short R Tutorial


Lets get started using R. When you enter an expression into the R console and press the Enter key, R will evaluate that expression and display the results (if there are any). If the statement results in a value, R will print that value. For example, you can use R to do simple math:

The interactive R interpreter will automatically print an object returned by an expression entered into the R console. Notice the funny [1] that accompanies each returned value. In R, any number that you enter in the console is interpreted as a vector. A vector is an ordered collection of numbers. The [1] means that the index of the first item displayed in the row is 1. In each of these cases, there is also only one element in the vector.

You can construct longer vectors using the c(...) function. (c stands for combine.) For example:

is a vector that contains the first seven elements of the Fibonacci sequence. As an example of a vector that spans multiple lines, lets use the sequence operator to produce a vector with every integer between 1 and 50:


4. R Packages


A package is a related set of functions, help files, and data files that have been bundled together. Packages in R are similar to modules in Perl, libraries in C/C++, and classes in Java.

Typically, all of the functions in the package are related: for example, the stats package contains functions for doing statistical analysis. To use a package, you need to load it into R (see Loading Packages for directions on loading package).

R offers an enormous number of packages: packages that display graphics, packages for performing statistical tests, and packages for trying the latest machine learning techniques. There are also packages designed for a wide variety of industries and applications: packages for analyzing microarray data, packages for modeling credit risks, and packages for social sciences.

Some of these packages are included with R: you just have to tell R that you want to use them. Other packages are available from public package repositories. You can even make your own packages. This chapter explains how to use packages.


5. An Overview of the R Language


Learning a computer language is a lot like learning a spoken language (only much simpler). If youre just visiting a foreign country, you might learn enough phrases to get by without really understanding how the language is structured. Similarly, if youre just trying to do a couple simple things with R (like drawing some charts), you can probably learn enough from examples to get by.

However, if you want to learn a new spoken language really well, you have to learn about syntax and grammar: verb conjugation, proper articles, sentence structure, and so on. The same is true with R: if you want to learn how to program effectively in R, youll have to learn more about the syntax and grammar.

This chapter gives an overview of the R language, designed to help you understand R code and write your own. Ill assume that youve spent a little time looking at R syntax (maybe from reading Chapter3). Heres a quick overview of how R works.


6. R Syntax


Lets start by looking at constants. Constants are the basic building blocks for data objects in R: numbers, character values, and symbols.

Numbers are interpreted literally in R:

You may specify values in hexadecimal notation by prefixing them with 0x:

By default, numbers in R expressions are interpreted as double-precision floating-point numbers, even when you enter simple integers:

If you really want an integer, you can use the sequence notation or the as function to obtain an integer:

The sequence operator a:b will return a vector of integers between a and b. To combine an arbitrary set of numbers into a vector, use the c function:

R allows a lot of flexibility when entering numbers. However, there is a limit to the size and precision of numbers that R can represent:

In practice, this is rarely a problem. Most R users will load data from other sources on a computer (like a database) that also cant represent very large numbers.


7. R Objects


Table7-1 shows all of the built-in object types. I introduced these objects in Chapter3, so they should seem familiar. I classified the object types into a few categories, to make them easier to understand.

These are vectors containing a single type of value: integers, floating-point numbers, complex numbers, text, logical values, or raw data.

These objects are containers for the basic vectors: lists, pairlists, S4 objects, and environments. Each of these objects has unique properties (described below), but each of them contains a number of named objects.

These objects serve a special purpose in R programming: any, NULL, and .... Each of these means something important in a specific context, but you would never create an object of these types.

These are objects that represent R code; they can be evaluated to return other objects.

Functions are the workhorses of R; they take arguments as inputs and return objects as outputs. Sometimes, they may modify objects in the environment or cause side effects outside the R environment like plotting graphics, saving files, or sending data over the network.


8. Symbols and Environments


When you define a variable in R, you are actually assigning a symbol to a value in an environment. For example, when you enter the statement:

on the R console, it assigns the symbol x to a vector object of length 1 with the constant (double) value 1 in the global environment. When the R interpreter evaluates an expression, it evaluates all symbols. If you compose an object from a set of symbols, R will resolve the symbols at the time that the object is constructed:

It is possible to delay evaluation of an expression so that symbols are not evaluated immediately:

It is also possible to create a promise object in R to delay evaluation of a variable until it is (first) needed. You can create a promise object through the delayedAssign function:

Promise objects are used within packages to make objects available to users without loading them into memory. Unfortunately, it is not possible to determine if an object is a promise object, nor is it possible to figure out the environment in which it was created.


9. Functions


A function definition in R includes the names of arguments. Optionally, it may include default values. If you specify a default value for an argument, then the argument is considered optional:

If you do not specify a default value for an argument, and you do not specify a value when calling the function, you will get an error if the function attempts to use the argument:[24]

In a function call, you may override the default value:

In R, it is often convenient to specify a variable-length argument list. You might want to pass extra arguments to another function, or you may want to write a function that accepts a variable number of arguments. To do this in R, you specify an ellipsis (...) in the arguments to the function.[25]

As an example, lets create a function that prints the first argument and then passes all the other arguments to the summary function. To do this, we will create a function that takes one argument: x. The arguments specification also includes an ellipsis to indicate that the function takes other arguments. We can then call the summary function with the ellipsis as its argument:


10. Object-Oriented Programming


The R system includes some support for object-oriented programming (OOP). OOP has become the preferred paradigm for organizing computer software; its used in almost every modern programming language (Java, C#, Ruby, and Objective C, among others) and in quite a few old ones (Smalltalk, C++). Its easy to understand why: OOP methods lead to code that is faster to write, easier to maintain, and less likely to contain errors. Many R packages are written using OOP mechanisms.

If all you plan to do with R is to load some data, build some statistical models, and plot some charts, you can probably skim this chapter. On the other hand, if you want to write your own code for loading data, building statistical models, and plotting charts, you probably should read this chapter more carefully.

R includes two different mechanisms for object-oriented programming. As you may recall, the R language is derived from the S language. Ss object-oriented programming system evolved over time. Around 1990, S version 3 (thus S3) introduced class attributes that allowed single-argument methods. Many R functions (such as the statistical modeling software) were implemented using S3 methods, so S3 methods are still around today. In S version 4 (hence S4), formal classes and methods were introduced that allowed multiple arguments, more abstract types, and more sophisticated inheritance. Many new packages were implemented using S4 methods (and you can find S4 implementations of many key statistical procedures as well). In particular, formal classes are used extensively in Bioconductor.


11. High-Performance R


When possible, try to use built-in functions for mathematical computations instead of writing R code to perform those computations. Many common math functions are included as native functions in R. In most cases, these functions are implemented as calls to external math libraries. As an obvious example, if you want to multiply two matrices together, you should probably use the %% operator and not write your own matrix multiplication code in R.

Often, it is possible to use built-in functions by transforming a problem. As an example, lets consider an example from queueing theory. Queueing theory is the study of systems where customers arrive, wait in a queue for service, are served, and then leave. As an example, picture a cafeteria with a single cashier. After customers select their food, they proceed to the cashier for payment. If there is no line, they pay the cashier and then leave. If there is a line, they wait in the line until the cashier is free. If we suppose that customers arrive according to a Poisson process and that the time required for the cashier to finish each transaction is given by an exponential distribution, then this is called an M/M/1 queue. (This means memoryless arrivals, memoryless service time, and one server.)


12. Saving, Loading, and Editing Data


If you are entering a small number of observations, entering the data directly into R might be a good approach. There are a couple of different ways to enter data into R.

Many of the examples in Parts I and II show how to create new objects directly on the R console. If you are entering a small amount of data, this might be a good approach.

As we have seen before, to create a vector, use the c function:

Its often convenient to put these vectors together into a data frame. To create a data frame, use the data.frame function to combine the vectors:

Entering data using individual statements can be awkward for more than a handful of observations. (Thats why my example above only included five observations.) Luckily, R provides a nice GUI for editing tabular data: the data editor.

To edit an object with the data editor, use the edit function. The edit function will open the data editor and return the edited object. For example, to edit the top.5.salaries data frame, you would use the following command:


13. Preparing Data


Back in my freshman year of college, I was planning to be a biochemist. I spent hours and hours of time in the lab: mixing chemicals in test tubes, putting samples in different machines, and analyzing the results. Over time, I grew frustrated because I found myself spending weeks in the lab doing manual work and just a few minutes planning experiments or analyzing results. After a year, I gave up on chemistry and became a computer scientist, thinking that I would spend less time on preparation and testing and more time on analysis.

Unfortunately for me, I chose to do data mining work professionally. Everyone loves building models, drawing charts, and playing with cool algorithms. Unfortunately, most of the time you spend on data analysis projects is spent on preparing data for analysis. Id estimate that 80% of the effort on a typical project is spent on finding, cleaning, and preparing data for analysis. Less than 5% of the effort is devoted to analysis. (The rest of the time is spent on writing up what you did.)


14. Graphics


R includes tools for drawing most common types of charts, including bar charts, pie charts, line charts, and scatter plots. Additionally, R can also draw some less familiar charts like quantile-quantile (Q-Q) plots, mosaic plots, and contour plots. The following table shows many of the charts included in the graphics package.

You can show R graphics on the screen or save them in many different formats. Graphics Devices explains how to choose output methods. R gives you an enormous amount of control over graphics. You can control almost every aspect of a chart. Customizing Charts explains how to tweak the output of R to look the way you want. This section shows how to use many common types of R charts.

To show how to use scatter plots, we will look at cases of cancer in 2008 and toxic waste releases by state in 2006. Data on new cancer cases (and deaths from cancer) are tabulated by the American Cancer Society; information on toxic chemicals released into the environment is tabulated by the U.S. Environmental Protection Agency (EPA).[35]


15. Lattice Graphics


In the early 1990s, Richard Becker and William Cleveland (two researchers at Bell Labs) built a revolutionary new system for displaying data called Trellis graphics. (You can find more information about the Trellis software at Cleveland devised a number of novel plots for visualizing data based on research into how users visualize information.[41]

The lattice package is an implementation of Trellis graphics in R.[42] You may notice that some functions still contain the Trellis name. The lattice package includes many types of charts that will be familiar to most readers such as scatter plots, bar charts, and histograms. But it also includes some plots that you may not have seen before such as dot plots, strip plots, and quantile-quantile plots. This chapter will show you how to use different types of charts, familiar and unfamiliar, in the lattice package.


16. Analyzing Data


R includes a variety of functions for calculating summary statistics.

To calculate the mean of a vector, use the mean function. You can calculate minima with the min function, or maxima with the max function. As an example, lets use the dow30 data set that we created in An extended example. This data set is also available in the nutshell package:

For each of these functions, the argument na.rm specifies how NA values are treated. By default, if any value in the vector is NA, then the value NA is returned. Specify na.rm=TRUE to ignore missing values:

Optionally, you can also remove outliers when using the mean function. To do this, use the trim argument to specify the fraction of observations to filter:

To calculate the minimum and maximum at the same time, use the range function. This returns a vector with the minimum and maximum value:

Another useful function is quantile. This function can be used to return the values at different percentiles (specified by the probs argument):


17. Probability Distributions


As an example, well start with the normal distribution. As you may remember from statistics classes, the probability density function for the normal distribution is:

To find the probability density at a given value, use the dnorm function:

The arguments to this function are fairly intuitive: x specifies the value at which to evaluate the density, mean specifies the mean of the distribution, sd specifies the standard deviation, and log specifies whether to return the raw density (log=FALSE) or the logarithm of the density (log=TRUE). As an example, you can plot the normal distribution with the following command:

The plot is shown in Figure17-1.

Figure17-1.Normal distribution

The distribution function for the normal distribution is pnorm:

You can use the distribution function to tell you the probability that a randomly selected value from the distribution is less than or equal to q. Specifically, it returns p = Pr(x q). The value q is specified by the argument q, the mean by mean, and the standard deviation by sd. If you would like the raw value p, then specify log.p=FALSE; if you would like log(p), then specify log.p=TRUE. By default, lower.tail=TRUE, so this function returns Pr(x q); if you would prefer Pr(x > q), then specify lower.tail=FALSE. Here are a few examples of pnorm:


18. Statistical Tests


Many data problems boil down to statistical tests. For example, you might want to answer a question like:

Does this new drug work better than a placebo?

Does the new web site design lead to significantly more sales than the old design?

Can this new investment strategy yield higher returns than an index fund?

To answer questions like these, you would formulate a hypothesis, design an experiment, collect data, and use a tool like R to analyze the data. This chapter focuses on the tools available in R for answering these questions.

To be helpful, Ive tried to include enough description of different statistical methods to help remind you when to use each method (in addition to how to find them in R). However, because this is a Nutshell book, I cant describe where these formulas come from, or when theyre safe to use. R is a good substitute for expensive, proprietary statistics software packages. However, R in a Nutshell isnt a good substitute for a good statistics course or a good statistics book.


Load more


Print Book

Format name
File size
5.41 MB
Read aloud
Format name
Read aloud
In metadata
In metadata
File size
In metadata