Introduction to R/RStudio

R

R is more than just a "stats package".

It is a statistical programming language!

R is an object-oriented programming language.

A common question is, "Can R do this?".

The answer is always, "Yes, but can you do it?".
- Just to give you an idea, tuneR can be used to make electronic music

Is R Really Different?

Cost
- You cannot beat free
Power
- Object-oriented programming language with near-infinite flexability
Open
- Anyone can contribute to it and/or look at function code
Maintainers/Package Creators
- The people who develop methods create packages for those methods

Common Hang-ups

"SAS/SPSS/Stata can do chi-square, is there really a difference?"

"I am not a programmer."

"R is really hard/R has a huge learning curve."

"I don't have time to learn."

## 
## Can one be a good data analyst without being a half-good programmer? The
## short answer to that is, 'No.' The long answer to that is, 'No.'
##    -- Frank Harrell
##       1999 S-PLUS User Conference, New Orleans (October 1999)

use the fortunes package for inspiration!

Relevant

## 
## Actually, I see it as part of my job to inflict R on people who are
## perfectly happy to have never heard of it. Happiness doesn't equal
## proficient and efficient. In some cases the proficiency of a person serves
## a greater good than their momentary happiness.
##    -- Patrick Burns
##       R-help (April 2005)

Is It The Only One?

Of course not!

Despite being able to do absolutely anything, somethings are easier to do outside of R.

Just remember that you need to try to make your work reproducible.
- Just don't do anything by hand!

You can use any number of packages for your research, but R is very close to a "one-stop-shop".

CRAN

The Comprehensive R Archive Network is the "official" package repository for R.

There are currently 9044 packages on CRAN.

For perspective, R has 200X more functions than SAS.

Finding Information

CRAN Task Views allow you to see a variety of functions associated with topics.

Task View Examples	Example Packages
Econometrics	wbstats & plm
Finance	quantmod & urca
Machine Learning	rpart & caret
Natural Language Processing	tm & koRpus
Psychometrics	lavaan & mirt
Spatial	sp & rgdal
Time Series	zoo & forecast

Installing packages

From CRAN

install.packages(c("devtools", "dplyr"))

From GitHub:

devtools::install_github("saberry/qualtricsR")

The devtools package also has the ability to install packages from other repositories (e.g., bitbucket, svn).

The Basics

Everything in R is an object.

You must create an object and you can then call on the object.

Always be sure to name objects something other than function names!

numList = 1:5
numList

## [1] 1 2 3 4 5

<- is the classic assignment operator.

numList * 5

## [1]  5 10 15 20 25

Object Types

R has many different kinds of objects:

Item

Numeric
Character
Factor/ordered

Data

Data frame
Matrix
List

The Index

Because R creates objects, each object can be manipulated through an index.

Like many other languages, an object's index is generally accessed using []:

numList[1:3]

## [1] 1 2 3

numList[1:3] * 5

## [1]  5 10 15

For named objects, we can use the $:

head(mtcars$mpg)

## [1] 21.0 21.0 22.8 21.4 18.7 18.1

A Big Index Hint

Just like matrix algebra and dimensional lumber – obj[rows, columns]

mtcars[1, ]

##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

head(mtcars[, 1])

## [1] 21.0 21.0 22.8 21.4 18.7 18.1

mtcars[1, 1]

## [1] 21

Operators and Math Functions

Like any other language (or program, for that matter), R has the ability to use operators:

mtcars$mpg[mtcars$cyl == 6 | mtcars$cyl == 8 & mtcars$hp >= 146]

##  [1] 21.0 21.0 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
## [15] 15.5 15.2 13.3 19.2 15.8 19.7 15.0

And math functions:

sqrt((2 + 2)^2 * (7 / (2 - 1))) * pi

## [1] 33.24749

Basic Functions

Even with all of the packages that R has, base R is still extremely powerful by itself.

str(numList)

##  int [1:5] 1 2 3 4 5

summary(numList)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5

mean(numList)

## [1] 3

Basic Analyses

cor(mtcars$mpg, mtcars$wt)

## [1] -0.8676594

lm(mpg ~ wt, data = mtcars)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

Basic Plotting

plot(mtcars$wt, mtcars$mpg, pch = 19)

Combining Functions

R allows you to combine functions:

plot(mtcars$wt, mtcars$mpg, pch = 19)
lines(lowess(mtcars$wt, mtcars$mpg), col = "#FF6600", lwd = 2)
abline(lm(mpg ~ wt, data = mtcars), col = "#0099ff", lwd = 2)

Modern Approaches For Data Wrangling

We saw a glimpse of what base R has to offer in terms of data manipulation.

As powerful as the indexing approach may be, it can often be messy and slightly confusing to someone who may be interested in using your code (or the future you).

Because it is object-oriented, it is inherently more powerful than traditional stats programs.

An Example

### NICE R DATA ###

# numeric indexes; not conducive to readibility or reproducibility
newData = mtcars[, 1:4]

# explicitly by name; fine if only a handful; not pretty
newData = mtcars[, c('mpg','cyl', 'disp', 'hp')]

### MEAN REAL DATA ###

# two step with grep (searching with regular expressions)
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', 
         grep("^Merc[0-9]+", colnames(oldData), value = TRUE))

newData = oldData[, cols]

# or via subset
newData = subset(oldData, select = cols)

What if you also want observations where Z is Yes, Q is No, and only the last 50 of those results, ordered by var1 (descending)?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No', ]
newData = tail(newData, 50)
newData = newdata[order(newdata$var1, decreasing = TRUE), ]

And this is for fairly straightforward operations.

dplyr

The dplyr package was created to make data manipulation easier.

newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('Merc')) %>% 
  tail(50) %>% 
  arrange(desc(var1))

Other Handy Functions

mtcars %>% 
  filter(am == 0) %>% # Automatic transmission
  select(mpg, cyl, hp, wt) %>% 
  mutate(rawWeight = wt * 1000) %>% 
  group_by(cyl) %>% 
  summarize_all(funs(mean))

## Source: local data frame [3 x 5]
## 
##     cyl    mpg        hp       wt rawWeight
##   <dbl>  <dbl>     <dbl>    <dbl>     <dbl>
## 1     4 22.900  84.66667 2.935000  2935.000
## 2     6 19.125 115.25000 3.388750  3388.750
## 3     8 15.050 194.16667 4.104083  4104.083

coalesce

x = c(1, 2, NA, NA, 5, 6, NA, 8, NA, NA)

y = c(NA, NA, 3, 4, NA, NA, NA, NA, NA, NA)

z = c(NA, NA, NA, NA, NA, NA, 7, NA, 9, 10 )

coalesce(x, y, z)

##  [1]  1  2  3  4  5  6  7  8  9 10

A Quick Word on %>%

In the previous snippet, you hopefully noticed the %>%.

It is included in dplyr, but it originates in magrittr.

It is pronounced as pipe and is functionally equivalent to the Unix |

Check out magrittr for other pipes.

Why?

Old-school R:

ceiling(mean(abs(sample(-100:100, 50))))

Piping:

-100:100 %>% 
  sample(50) %>% 
  abs %>% 
  mean %>% 
  ceiling

Both are valid, but one is just a bit easier for human eyes and easier to code.

Just Scratching The Surface

We have only really seen the tip of the iceberg with regard to what R has to offer.

We did not even talk about all its analytical capabilities – just know that it does anything!

Do take some time to look through the CRAN Task Views.

The RBloggers website always has new and neat stuff.

Daily and weekly trending repositories on GitHub are also enlightening.

## 
## If you think you can learn all of R, you are wrong. For the foreseeable
## future you will not even be able to keep up with the new additions.
##    -- Patrick Burns (Inferno-ish R)
##       CambR User Group Meeting, Cambridge (May 2012)

RStudio

RStudio is an integrated development environment (IDE) for R.

It was developed by J.J. Allaire and Hadley Wickham is the Chief Scientist.

There are others, but all pale in comparison.

Scripting

Before RStudio, there were other options beside using the console.

My heart goes out to everyone else who has learned on the console and/or Tinn-R.

In scripting with RStudio, you are getting:

- Code completion (use tab to autocomplete anything)
- Code highlighing
- Code diagnostics/warnings
- Code snippets (tab for apply and loops)
- Easily accessible help files (F1 on any function)
- Code tidying (Ctrl + Shift + A)
- More shortcuts than you can learn (Alt + Shift + K)
- Automatic pairing of closures (or...ruining your typing)

File Types

In addition to R scripts, RStudio offers an array of file types:

Text
- JS
- CSS
- Python
Markdown
Presentation
C++

Any language file has the code highlighted and diagnosed.

Interactive Data Exploration

With an assortment of the htmlwidgets packages, we can create a wide variety of output.

ggvis, rbokeh, plotly: interactive visualizations
dygraphs: interactive time-series visualizations
DT: interactive tables
Leaflet: geo-spatial maps

An Example

library(dygraphs); library(tidyr)
as.data.frame(groupTotals) %>%
  select(year, perp, tot) %>%
  group_by(year, perp) %>%
  summarize(tot = sum(tot)) %>%
  arrange(year) %>%
  spread(perp, tot) %>%
  dygraph() %>%
  dyLegend(show = "onmouseover") %>%
  dyHighlight(highlightSeriesBackgroundAlpha = .2,
              hideOnMouseOut = FALSE) %>%
  dyCSS("~/R/conflictData/dygraphLegend.css")

An Example

Another Example (Because They Are Fun!)

library(plotly)

plot_ly(economics, x = date, y = uempmed) %>%
  add_trace(y = fitted(loess(uempmed ~ as.numeric(date))), x = date) %>%
  layout(title = "Median duration of unemployment (in weeks)",
         showlegend = FALSE) %>%
  dplyr::filter(uempmed == max(uempmed)) %>%
  layout(annotations = list(x = date, y = uempmed, text = "Peak", showarrow = T))

Another Example (Because They Are Fun!)

Last One…

library(ggplot2)

p = ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(text = paste("Transmission:", as.factor(am))), size = 2) +
  geom_smooth(aes(colour = as.ordered(cyl), fill = as.ordered(cyl)), 
              show.legend = FALSE) + 
  facet_grid(. ~ cyl) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  #scale_colour_discrete(name = "Cylinders") +
  lazerhawk::theme_trueMinimal()

ggplotly(p)

Last One…

Interactive Tables

DT::datatable(head(mtcars1), filter = "top")

Document Generation

RStudio has built-in capacity to use knitr and rmarkdown.

These packages, in conjunction with bits and pieces of $\LaTeX$, allow you to create reproducible documents.

Example

`r lmSum = summary(lm(mpg ~ wt, data = mtcars))

if (lmSum$coefficients[2, 4] < .05) {
  paste("Weight's coefficient of", 
        round(lmSum$coefficients[2], 3), 
        "is significant", sep = " ")
} else {paste("Weight's coefficient of", 
        round(lmSum$coefficients[2], 3), 
        "is not significant", sep = " ")}`

## [1] "Weight's coefficient of -5.344 is significant"

An Example In Practice

A good man once said:

You, my dear sir, are but a mere artless base-court boar-pig and I bid you a good day.

The Code

paste(sample(c('artless','bawdy','beslubbering','bootless'), 1), 
      sample(c('base-court','bat-fowling','beef-witted','beetle-headed'), 1),
      sample(c('apple-john','baggage','barnacle','bladder','boar-pig'), 1))

## [1] "artless beetle-headed bladder"

Document Generation

Documents can be written out as PDF, HTML, or even Word.

The HTML documents can be pages or presentations.

With the exception of minor changes, it really does not take too much effort to switch between outputs.

There are also a handful of templates for you to use.

All of the Tufte class documents are amazing.

Projects And Built-in Version Control

RStudio let's you create a project with version control.

Cloning from existing repositories
Creating a new repository and "pushing" it up
Version control can be through Git or Subversion.

Package Management

There is nothing wrong with using the previously noted install.packages() function.

RStudio has a menu-driven package installation function.

It has a handy autocomplete function to it, so you can avoid any misspellings!

You are also given a list of all of your packages and which ones are currently loaded.

Data Read

One of the most common ways to read data into R is through read.csv().

This will take care of almost everything you ever need to do.

Like all thing, RStudio tries to make things easier.

The most recent version of RStudio comes with functionality for readr and haven.

These packages allow for faster (and more sensible defaults) data reading from cvs, Excel, SAS, SPSS, and Stata.

Addins

Addins are relatively new to RStudio.

They are essentially functions that you can call interactively.

Especially useful if you find yourself creating the same text lines.

lazerhawk::insertSlide()

Wrap

Wrapping It Up

Although our time was brief, I hope you can see the benefits of using R/RStudio for your research.

R only continues to develop, improve, and grow!

By using R, you will also find growth!

Maybe not happiness, but definitely growth.

## 
## It [the effort of learning how to use R] is the price paid, just as the
## dollars or euros for a commercial package would be. For that price, I've
## learnt a great deal - and not only about R. And I shall remember it when I
## next have to find a heavyweight solution for a big problem presented by a
## small charitable client with an invisible budget. It's a huge,
## awe-inspiring package - easier to perceive as such because the power is
## not hidden beneath a cosmetic veneer.
##    -- Felix Grant (in an article about free statistics software)
##       Scientific Computing World (November 2004)