You could get by only using the base and stats packages.
They are both powerful enough to do practically anything that you would ever need to do.
\(Power + Flexibility = Complexity\)
This was a common line for subsetting data:
mtcars[which(mtcars$mpg >= mean(mtcars$mpg) & mtcars$cyl == 6), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
varList = c("country",
paste("iso", 2:3, sep = ""),
grep(".*sp.*", colnames(who), value = TRUE))
newDat = who[, colnames(who) %in% varList]
The base approach is not intuitive.
That is where dplyr enters the arena.
It is arguably one of the most popular R packages.
Intuitive names are the first improvement:
arrange| between| coalesce| distinct| filter| group_by| left_join| mutate| select| summarize| tally| top_n
Just given the function names, we can probably guess what each of these do.
From this:
mtcars[which(mtcars$mpg >= mean(mtcars$mpg) & mtcars$cyl == 6), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
To this:
filter(mtcars, mpg >= mean(mpg) & cyl == 6)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
varList = c("country",
paste("iso", 2:3, sep = ""),
grep(".*sp.*", colnames(who),
value = TRUE))
newDat = who[, colnames(who) %in% varList]
select(who, country, num_range("iso", 2:3), contains("sp"))
dplyr also provides several different joins.
left_join – x rows and all columns
right_join – y rows and all columns
inner_join – all matching rows and all columns
full_join – all rows and all columns
anti_join – return x rows where there are no y matches
df1 = data.frame(ticks = c("AAPL", "A", "BFG"),
revenue = c(10, 30, 50))
df2 = data.frame(tick = c("AAPL", "A", "N"),
expends = c(5, 15, 25))
inner_join(df1, df2, by = c("ticks" = "tick"))
## ticks revenue expends
## 1 AAPL 10 5
## 2 A 30 15
Reshaping data can be a perilous excursion.
Plotting functions and many analyses need long data.
R has packages for reshaping (e.g., reshape2).
As good as those are, tidyr makes reshaping too easy.
It also has very clear functions:
gather, spread, unite, separate
head(gather(mtcars, key = variable, value = value))
## variable value
## 1 mpg 21.0
## 2 mpg 21.0
## 3 mpg 22.8
## 4 mpg 21.4
## 5 mpg 18.7
## 6 mpg 18.1
head(gather(mtcars, key = variable, value = value, -cyl))
## cyl variable value
## 1 6 mpg 21.0
## 2 6 mpg 21.0
## 3 4 mpg 22.8
## 4 6 mpg 21.4
## 5 8 mpg 18.7
## 6 6 mpg 18.1
(df = data.frame(executiveName = c("Jon Crutchfield", "Father Mark", "Roger Huang")))
## executiveName
## 1 Jon Crutchfield
## 2 Father Mark
## 3 Roger Huang
separate(df, executiveName, c("firstName", "lastName"))
## firstName lastName
## 1 Jon Crutchfield
## 2 Father Mark
## 3 Roger Huang
The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture ‘This is a pipe’, I’d have been lying!
The magrittr package contains a number of different pipes.
%$%| %<>%| %>%| %T>%
R has some mysterious functions.
magrittr gives names to some of these rather archiac functions.
`[[` = extract2
`[[`(mtcars, 1)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
extract2(mtcars, 1)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Using these three packages together gives us ultimate power.
We can pipe functions together until we get what we need.
I want years greater than 2000, only columns with “sn”, averaged by country, and taken from wide to long.
who1 = who[who$year > 2000, grep("country|.*sn.*", colnames(who))]
who1 = na.omit(who1)
colList = list(grep("*sn.*", colnames(who1), value = TRUE))
res = lapply(colList,
function(x) aggregate(who1[x], by = list(who1$country), FUN = mean))
res = as.data.frame(res)
row.names(res) = res$Group.1
res$Group.1 = NULL
res = reshape(res, idvar = "country", ids = row.names(res),
times = names(res), timevar = "var",
varying = list(1:14),
direction = "long")
head(res)
## var new_sn_m014
## Albania.new_sn_m014 new_sn_m014 7.0000
## American Samoa.new_sn_m014 new_sn_m014 0.0000
## Andorra.new_sn_m014 new_sn_m014 0.0000
## Anguilla.new_sn_m014 new_sn_m014 0.0000
## Antigua and Barbuda.new_sn_m014 new_sn_m014 0.0000
## Argentina.new_sn_m014 new_sn_m014 350.8571
## country
## Albania.new_sn_m014 Albania
## American Samoa.new_sn_m014 American Samoa
## Andorra.new_sn_m014 Andorra
## Anguilla.new_sn_m014 Anguilla
## Antigua and Barbuda.new_sn_m014 Antigua and Barbuda
## Argentina.new_sn_m014 Argentina
who1 = who %>%
filter(year > 2000) %>%
select(country, contains("sn")) %>%
na.omit() %>%
group_by(country) %>%
summarize_all(funs(mean(., na.rm = TRUE))) %>%
gather(key = variable, value = value, -country)
head(who1)
## # A tibble: 6 × 3
## country variable value
## <chr> <chr> <dbl>
## 1 Albania new_sn_m014 7.0000
## 2 American Samoa new_sn_m014 0.0000
## 3 Andorra new_sn_m014 0.0000
## 4 Anguilla new_sn_m014 0.0000
## 5 Antigua and Barbuda new_sn_m014 0.0000
## 6 Argentina new_sn_m014 350.8571
The data.table package is a pretty far departure from dplyr.
Everything works with i, j, k indexing.
\[data.table(i, j, k) = data.table[which, what, by]\]
While dplyr is clear and concise, data.table is slightly confusing.
What it lacks in elegance, it makes up for in speed!
whoDT = data.table(who, key = "country")
dtTest = whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country]
dplyrTest = who %>%
group_by(country) %>%
summarize(mean(new_sp_m014, na.rm = TRUE))
microbenchmark::microbenchmark(who %>%
group_by(country) %>%
summarize(mean(new_sp_m014, na.rm = TRUE)),
whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country],
unit = "s", times = 500)
## Unit: seconds
## expr
## who %>% group_by(country) %>% summarize(mean(new_sp_m014, na.rm = TRUE))
## whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country]
## min lq mean median uq
## 0.004150377 0.0046395400 0.0057052357 0.0053575105 0.006462634
## 0.000493370 0.0006326375 0.0008423986 0.0007740085 0.000954139
## max neval cld
## 0.010680166 500 b
## 0.003660012 500 a
There are a few different convenience functions within data.table.
These two are incredibly helpful.
.I = seq_len(nrow(x))
.GRP = group counter
Just like the pipes we previously saw, data.table lets us continuously chain statements together.
whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country][, log(V1), by = country]
## country V1
## 1: Afghanistan 4.868015
## 2: Albania 0.000000
## 3: Algeria 4.532599
## 4: American Samoa -1.386294
## 5: Andorra -Inf
## ---
## 215: Wallis and Futuna Islands -Inf
## 216: West Bank and Gaza Strip -1.203973
## 217: Yemen 4.180352
## 218: Zambia 5.508713
## 219: Zimbabwe 5.068332
It can get ugly (and fast)!
mrktingName = mrktingName[, c("Ticker.Symbol", "Fiscal.Year",
"howLongPerson", "personID") :=
list(Ticker.Symbol, Fiscal.Year,
seq(.N), .GRP),
by = list(name)]
[, numSeq := seq(min(Fiscal.Year),
max(Fiscal.Year)), by = list(personID)]
[, runningTotal := ifelse(Fiscal.Year == numSeq, seq(.N), 1L),
by = list(Ticker.Symbol, personID)]
You can enjoy all flavors at once!
who %>%
select(country, contains("sp")) %>%
data.table(key = "country") %>%
.[, c("epM25Mean", "epF25Mean") :=
list(mean(new_sp_m2534, na.rm = TRUE), mean(new_sp_f2534, na.rm = TRUE)),
by = country] %>%
filter(epM25Mean > 0 & epF25Mean > 0) %>%
.[, plot(.$epM25Mean, .$epF25Mean)]
## data frame with 0 columns and 6892 rows
stringr eases string pain.
Unless you have a need to use regex regularly, it can be tricky.
stringr has a variety of functions to replace grep and gsub.
What would the following line do?
^\\s+|\\s+$
x = " Happy Halloween! "
gsub("^\\s+|\\s+$", "", x)
## [1] "Happy Halloween!"
str_trim(x)
## [1] "Happy Halloween!"
people = c("Seth Berry", "George Jiang", NA, "Andres Martinez")
nchar(people)
## [1] 10 12 NA 15
str_count(people)
## [1] 10 12 NA 15