R is more than just a "stats package".
R is an object-oriented programming language.
A common question is, "Can R do this?".
The answer is always, "Yes, but can you do it?".
Cost
Power
Open
Maintainers/Package Creators
"SAS/SPSS/Stata can do chi-square, is there really a difference?"
"I am not a programmer."
"R is really hard/R has a huge learning curve."
"I don't have time to learn."
## ## Can one be a good data analyst without being a half-good programmer? The ## short answer to that is, 'No.' The long answer to that is, 'No.' ## -- Frank Harrell ## 1999 S-PLUS User Conference, New Orleans (October 1999)
## ## Actually, I see it as part of my job to inflict R on people who are ## perfectly happy to have never heard of it. Happiness doesn't equal ## proficient and efficient. In some cases the proficiency of a person serves ## a greater good than their momentary happiness. ## -- Patrick Burns ## R-help (April 2005)
Of course not!
Despite being able to do absolutely anything, somethings are easier to do outside of R.
Just remember that you need to try to make your work reproducible.
You can use any number of packages for your research, but R is very close to a "one-stop-shop".
The Comprehensive R Archive Network is the "official" package repository for R.
CRAN Task Views allow you to see a variety of functions associated with topics.
Task View Examples | Example Packages |
---|---|
Econometrics | wbstats & plm |
Finance | quantmod & urca |
Machine Learning | rpart & caret |
Natural Language Processing | tm & koRpus |
Psychometrics | lavaan & mirt |
Spatial | sp & rgdal |
Time Series | zoo & forecast |
From CRAN
install.packages(c("devtools", "dplyr"))
From GitHub:
devtools::install_github("saberry/qualtricsR")
Everything in R is an object.
You must create an object and you can then call on the object.
numList = 1:5 numList
## [1] 1 2 3 4 5
numList * 5
## [1] 5 10 15 20 25
R has many different kinds of objects:
Item
Numeric
Character
Factor/ordered
Data
Data frame
Matrix
List
Because R creates objects, each object can be manipulated through an index.
Like many other languages, an object's index is generally accessed using []:
numList[1:3]
## [1] 1 2 3
numList[1:3] * 5
## [1] 5 10 15
For named objects, we can use the $:
head(mtcars$mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
Just like matrix algebra and dimensional lumber – obj[rows, columns]
mtcars[1, ]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
head(mtcars[, 1])
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
mtcars[1, 1]
## [1] 21
Like any other language (or program, for that matter), R has the ability to use operators:
mtcars$mpg[mtcars$cyl == 6 | mtcars$cyl == 8 & mtcars$hp >= 146]
## [1] 21.0 21.0 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 ## [15] 15.5 15.2 13.3 19.2 15.8 19.7 15.0
And math functions:
sqrt((2 + 2)^2 * (7 / (2 - 1))) * pi
## [1] 33.24749
Even with all of the packages that R has, base R is still extremely powerful by itself.
str(numList)
## int [1:5] 1 2 3 4 5
summary(numList)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1 2 3 3 4 5
mean(numList)
## [1] 3
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
lm(mpg ~ wt, data = mtcars)
## ## Call: ## lm(formula = mpg ~ wt, data = mtcars) ## ## Coefficients: ## (Intercept) wt ## 37.285 -5.344
plot(mtcars$wt, mtcars$mpg, pch = 19)
R allows you to combine functions:
plot(mtcars$wt, mtcars$mpg, pch = 19) lines(lowess(mtcars$wt, mtcars$mpg), col = "#FF6600", lwd = 2) abline(lm(mpg ~ wt, data = mtcars), col = "#0099ff", lwd = 2)
We saw a glimpse of what base R has to offer in terms of data manipulation.
As powerful as the indexing approach may be, it can often be messy and slightly confusing to someone who may be interested in using your code (or the future you).
### NICE R DATA ### # numeric indexes; not conducive to readibility or reproducibility newData = mtcars[, 1:4] # explicitly by name; fine if only a handful; not pretty newData = mtcars[, c('mpg','cyl', 'disp', 'hp')] ### MEAN REAL DATA ### # two step with grep (searching with regular expressions) cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep("^Merc[0-9]+", colnames(oldData), value = TRUE)) newData = oldData[, cols] # or via subset newData = subset(oldData, select = cols)
What if you also want observations where Z is Yes, Q is No, and only the last 50 of those results, ordered by var1 (descending)?
# three operations and overwriting or creating new objects if we want clarity newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No', ] newData = tail(newData, 50) newData = newdata[order(newdata$var1, decreasing = TRUE), ]
And this is for fairly straightforward operations.
The dplyr package was created to make data manipulation easier.
newData = oldData %>% filter(Z == 'Yes', Q == 'No') %>% select(num_range('X', 1:10), contains('var'), starts_with('Merc')) %>% tail(50) %>% arrange(desc(var1))
mtcars %>% filter(am == 0) %>% # Automatic transmission select(mpg, cyl, hp, wt) %>% mutate(rawWeight = wt * 1000) %>% group_by(cyl) %>% summarize_all(funs(mean))
## Source: local data frame [3 x 5] ## ## cyl mpg hp wt rawWeight ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 22.900 84.66667 2.935000 2935.000 ## 2 6 19.125 115.25000 3.388750 3388.750 ## 3 8 15.050 194.16667 4.104083 4104.083
x = c(1, 2, NA, NA, 5, 6, NA, 8, NA, NA) y = c(NA, NA, 3, 4, NA, NA, NA, NA, NA, NA) z = c(NA, NA, NA, NA, NA, NA, 7, NA, 9, 10 ) coalesce(x, y, z)
## [1] 1 2 3 4 5 6 7 8 9 10
In the previous snippet, you hopefully noticed the %>%.
It is included in dplyr, but it originates in magrittr.
It is pronounced as pipe and is functionally equivalent to the Unix |
Old-school R:
ceiling(mean(abs(sample(-100:100, 50))))
Piping:
-100:100 %>% sample(50) %>% abs %>% mean %>% ceiling
Both are valid, but one is just a bit easier for human eyes and easier to code.
We have only really seen the tip of the iceberg with regard to what R has to offer.
Do take some time to look through the CRAN Task Views.
The RBloggers website always has new and neat stuff.
Daily and weekly trending repositories on GitHub are also enlightening.
## ## If you think you can learn all of R, you are wrong. For the foreseeable ## future you will not even be able to keep up with the new additions. ## -- Patrick Burns (Inferno-ish R) ## CambR User Group Meeting, Cambridge (May 2012)
Before RStudio, there were other options beside using the console.
In scripting with RStudio, you are getting:
- Code completion (use tab to autocomplete anything) - Code highlighing - Code diagnostics/warnings - Code snippets (tab for apply and loops) - Easily accessible help files (F1 on any function) - Code tidying (Ctrl + Shift + A) - More shortcuts than you can learn (Alt + Shift + K) - Automatic pairing of closures (or...ruining your typing)
In addition to R scripts, RStudio offers an array of file types:
Any language file has the code highlighted and diagnosed.
With an assortment of the htmlwidgets packages, we can create a wide variety of output.
library(dygraphs); library(tidyr) as.data.frame(groupTotals) %>% select(year, perp, tot) %>% group_by(year, perp) %>% summarize(tot = sum(tot)) %>% arrange(year) %>% spread(perp, tot) %>% dygraph() %>% dyLegend(show = "onmouseover") %>% dyHighlight(highlightSeriesBackgroundAlpha = .2, hideOnMouseOut = FALSE) %>% dyCSS("~/R/conflictData/dygraphLegend.css")
library(plotly) plot_ly(economics, x = date, y = uempmed) %>% add_trace(y = fitted(loess(uempmed ~ as.numeric(date))), x = date) %>% layout(title = "Median duration of unemployment (in weeks)", showlegend = FALSE) %>% dplyr::filter(uempmed == max(uempmed)) %>% layout(annotations = list(x = date, y = uempmed, text = "Peak", showarrow = T))
library(ggplot2) p = ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(aes(text = paste("Transmission:", as.factor(am))), size = 2) + geom_smooth(aes(colour = as.ordered(cyl), fill = as.ordered(cyl)), show.legend = FALSE) + facet_grid(. ~ cyl) + scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") + #scale_colour_discrete(name = "Cylinders") + lazerhawk::theme_trueMinimal() ggplotly(p)
DT::datatable(head(mtcars1), filter = "top")
RStudio has built-in capacity to use knitr and rmarkdown.
These packages, in conjunction with bits and pieces of \(\LaTeX\), allow you to create reproducible documents.
`r lmSum = summary(lm(mpg ~ wt, data = mtcars)) if (lmSum$coefficients[2, 4] < .05) { paste("Weight's coefficient of", round(lmSum$coefficients[2], 3), "is significant", sep = " ") } else {paste("Weight's coefficient of", round(lmSum$coefficients[2], 3), "is not significant", sep = " ")}`
## [1] "Weight's coefficient of -5.344 is significant"
A good man once said:
You, my dear sir, are but a mere artless base-court boar-pig and I bid you a good day.
paste(sample(c('artless','bawdy','beslubbering','bootless'), 1), sample(c('base-court','bat-fowling','beef-witted','beetle-headed'), 1), sample(c('apple-john','baggage','barnacle','bladder','boar-pig'), 1))
## [1] "artless beetle-headed bladder"
Documents can be written out as PDF, HTML, or even Word.
The HTML documents can be pages or presentations.
With the exception of minor changes, it really does not take too much effort to switch between outputs.
There are also a handful of templates for you to use.
RStudio let's you create a project with version control.
Cloning from existing repositories
Creating a new repository and "pushing" it up
Version control can be through Git or Subversion.
There is nothing wrong with using the previously noted install.packages() function.
RStudio has a menu-driven package installation function.
You are also given a list of all of your packages and which ones are currently loaded.
One of the most common ways to read data into R is through read.csv().
Like all thing, RStudio tries to make things easier.
The most recent version of RStudio comes with functionality for readr and haven.
These packages allow for faster (and more sensible defaults) data reading from cvs, Excel, SAS, SPSS, and Stata.
Addins are relatively new to RStudio.
They are essentially functions that you can call interactively.
Especially useful if you find yourself creating the same text lines.
lazerhawk::insertSlide()
Although our time was brief, I hope you can see the benefits of using R/RStudio for your research.
R only continues to develop, improve, and grow!
By using R, you will also find growth!
## ## It [the effort of learning how to use R] is the price paid, just as the ## dollars or euros for a commercial package would be. For that price, I've ## learnt a great deal - and not only about R. And I shall remember it when I ## next have to find a heavyweight solution for a big problem presented by a ## small charitable client with an invisible budget. It's a huge, ## awe-inspiring package - easier to perceive as such because the power is ## not hidden beneath a cosmetic veneer. ## -- Felix Grant (in an article about free statistics software) ## Scientific Computing World (November 2004)