Practical Data Manipulation

base & stats

You could get by only using the base and stats packages.

They are both powerful enough to do practically anything that you would ever need to do.

The Downside

\(Power + Flexibility = Complexity\)

This was a common line for subsetting data:

mtcars[which(mtcars$mpg >= mean(mtcars$mpg) & mtcars$cyl == 6), ]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Subsetting Variables

varList = c("country", 
            paste("iso", 2:3, sep = ""),
            grep(".*sp.*", colnames(who), value = TRUE))

newDat = who[, colnames(who) %in% varList]

The Present

The base approach is not intuitive.

That is where dplyr enters the arena.

It is arguably one of the most popular R packages.

The Grammar Of Data Manipulation

Intuitive names are the first improvement:

arrange| between| coalesce| distinct| filter| group_by| left_join| mutate| select| summarize| tally| top_n

Just given the function names, we can probably guess what each of these do.

Translating Old To New

From this:

mtcars[which(mtcars$mpg >= mean(mtcars$mpg) & mtcars$cyl == 6), ]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

To this:

filter(mtcars, mpg >= mean(mpg) & cyl == 6)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Wrangling Made Easy

varList = c("country", 
            paste("iso", 2:3, sep = ""),
            grep(".*sp.*", colnames(who), 
                 value = TRUE))

newDat = who[, colnames(who) %in% varList]
select(who, country, num_range("iso", 2:3), contains("sp"))

SQL-like Joins

dplyr also provides several different joins.

left_join – x rows and all columns

right_join – y rows and all columns

inner_join – all matching rows and all columns

full_join – all rows and all columns

anti_join – return x rows where there are no y matches

Joining Data

df1 = data.frame(ticks = c("AAPL", "A", "BFG"), 
                 revenue = c(10, 30, 50))

df2 = data.frame(tick = c("AAPL", "A", "N"), 
                 expends = c(5, 15, 25))

inner_join(df1, df2, by = c("ticks" = "tick"))
##   ticks revenue expends
## 1  AAPL      10       5
## 2     A      30      15

tidyr

Reshaping data can be a perilous excursion.

Plotting functions and many analyses need long data.

R has packages for reshaping (e.g., reshape2).

As good as those are, tidyr makes reshaping too easy.

It also has very clear functions:

gather, spread, unite, separate

Wide To Long

head(gather(mtcars, key = variable, value = value))
##   variable value
## 1      mpg  21.0
## 2      mpg  21.0
## 3      mpg  22.8
## 4      mpg  21.4
## 5      mpg  18.7
## 6      mpg  18.1

Wide To Long – Grouped

head(gather(mtcars, key = variable, value = value, -cyl))
##   cyl variable value
## 1   6      mpg  21.0
## 2   6      mpg  21.0
## 3   4      mpg  22.8
## 4   6      mpg  21.4
## 5   8      mpg  18.7
## 6   6      mpg  18.1

Separate Columns

(df = data.frame(executiveName = c("Jon Crutchfield", "Father Mark", "Roger Huang")))
##     executiveName
## 1 Jon Crutchfield
## 2     Father Mark
## 3     Roger Huang
separate(df, executiveName, c("firstName", "lastName"))
##   firstName    lastName
## 1       Jon Crutchfield
## 2    Father        Mark
## 3     Roger       Huang

magrittr

The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture ‘This is a pipe’, I’d have been lying!

Pipes

The magrittr package contains a number of different pipes.

%$%| %<>%| %>%| %T>%

  • The most popular one is likely %>%.
  • It is functionally equivalent to the Unix |

Aliases

R has some mysterious functions.

magrittr gives names to some of these rather archiac functions.

`[[` = extract2

Action Sequence

`[[`(mtcars, 1)
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
extract2(mtcars, 1)
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Can’t Stop The Flow

Using these three packages together gives us ultimate power.

We can pipe functions together until we get what we need.

  • You can also pipe to other functions (e.g., lm).

Days Of Yore

I want years greater than 2000, only columns with “sn”, averaged by country, and taken from wide to long.

who1 = who[who$year > 2000, grep("country|.*sn.*", colnames(who))]

who1 = na.omit(who1)

colList = list(grep("*sn.*", colnames(who1), value = TRUE))

res = lapply(colList, 
             function(x) aggregate(who1[x], by = list(who1$country), FUN = mean))

And It Don’t Stop…

res = as.data.frame(res)

row.names(res) = res$Group.1

res$Group.1 = NULL

res = reshape(res, idvar = "country", ids = row.names(res), 
              times = names(res), timevar = "var",
              varying = list(1:14), 
              direction = "long")

head(res)
##                                         var new_sn_m014
## Albania.new_sn_m014             new_sn_m014      7.0000
## American Samoa.new_sn_m014      new_sn_m014      0.0000
## Andorra.new_sn_m014             new_sn_m014      0.0000
## Anguilla.new_sn_m014            new_sn_m014      0.0000
## Antigua and Barbuda.new_sn_m014 new_sn_m014      0.0000
## Argentina.new_sn_m014           new_sn_m014    350.8571
##                                             country
## Albania.new_sn_m014                         Albania
## American Samoa.new_sn_m014           American Samoa
## Andorra.new_sn_m014                         Andorra
## Anguilla.new_sn_m014                       Anguilla
## Antigua and Barbuda.new_sn_m014 Antigua and Barbuda
## Argentina.new_sn_m014                     Argentina

There Has To Be A Better Way

who1 = who %>% 
  filter(year > 2000) %>%
  select(country, contains("sn")) %>%
  na.omit() %>%
  group_by(country) %>% 
  summarize_all(funs(mean(., na.rm = TRUE))) %>% 
  gather(key = variable, value = value, -country)

head(who1)
## # A tibble: 6 × 3
##               country    variable    value
##                 <chr>       <chr>    <dbl>
## 1             Albania new_sn_m014   7.0000
## 2      American Samoa new_sn_m014   0.0000
## 3             Andorra new_sn_m014   0.0000
## 4            Anguilla new_sn_m014   0.0000
## 5 Antigua and Barbuda new_sn_m014   0.0000
## 6           Argentina new_sn_m014 350.8571

data.table

The data.table package is a pretty far departure from dplyr.

Everything works with i, j, k indexing.

\[data.table(i, j, k) = data.table[which, what, by]\]

While dplyr is clear and concise, data.table is slightly confusing.

What it lacks in elegance, it makes up for in speed!

That Is Fast

whoDT = data.table(who, key = "country")

dtTest = whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country]
dplyrTest = who %>% 
  group_by(country) %>% 
  summarize(mean(new_sp_m014, na.rm = TRUE))
microbenchmark::microbenchmark(who %>% 
  group_by(country) %>% 
  summarize(mean(new_sp_m014, na.rm = TRUE)), 
  whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country], 
  unit = "s", times = 500)
## Unit: seconds
##                                                                      expr
##  who %>% group_by(country) %>% summarize(mean(new_sp_m014, na.rm = TRUE))
##                    whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country]
##          min           lq         mean       median          uq
##  0.004150377 0.0046395400 0.0057052357 0.0053575105 0.006462634
##  0.000493370 0.0006326375 0.0008423986 0.0007740085 0.000954139
##          max neval cld
##  0.010680166   500   b
##  0.003660012   500  a

Convenience Functions

There are a few different convenience functions within data.table.

These two are incredibly helpful.

.I = seq_len(nrow(x))

.GRP = group counter

Chaining Indices

Just like the pipes we previously saw, data.table lets us continuously chain statements together.

whoDT[, mean(new_sp_m014, na.rm = TRUE), by = country][, log(V1), by = country]
##                        country        V1
##   1:               Afghanistan  4.868015
##   2:                   Albania  0.000000
##   3:                   Algeria  4.532599
##   4:            American Samoa -1.386294
##   5:                   Andorra      -Inf
##  ---                                    
## 215: Wallis and Futuna Islands      -Inf
## 216:  West Bank and Gaza Strip -1.203973
## 217:                     Yemen  4.180352
## 218:                    Zambia  5.508713
## 219:                  Zimbabwe  5.068332

It can get ugly (and fast)!

A Real Example

mrktingName = mrktingName[, c("Ticker.Symbol", "Fiscal.Year",
                              "howLongPerson", "personID") :=
                            list(Ticker.Symbol, Fiscal.Year,
                                 seq(.N), .GRP),
                          by = list(name)]
[, numSeq := seq(min(Fiscal.Year), 
                 max(Fiscal.Year)), by = list(personID)]
[, runningTotal := ifelse(Fiscal.Year == numSeq, seq(.N), 1L), 
by = list(Ticker.Symbol, personID)]

What Do R And Neapolitan Ice Cream Have In Common?

You can enjoy all flavors at once!

And It Is Just As Tasty…

who %>% 
  select(country, contains("sp")) %>% 
  data.table(key = "country") %>% 
  .[, c("epM25Mean", "epF25Mean") := 
      list(mean(new_sp_m2534, na.rm = TRUE), mean(new_sp_f2534, na.rm = TRUE)), 
    by = country] %>% 
  filter(epM25Mean > 0 & epF25Mean > 0) %>%
  .[, plot(.$epM25Mean, .$epF25Mean)]

## data frame with 0 columns and 6892 rows

stringr

stringr eases string pain.

Unless you have a need to use regex regularly, it can be tricky.

stringr has a variety of functions to replace grep and gsub.

Audience Participation Time!

What would the following line do?

^\\s+|\\s+$

Trimming Strings

x = " Happy Halloween!   "

gsub("^\\s+|\\s+$", "", x)
## [1] "Happy Halloween!"
str_trim(x)
## [1] "Happy Halloween!"

Counting Characters

people = c("Seth Berry", "George Jiang", NA, "Andres Martinez")
nchar(people)
## [1] 10 12 NA 15
str_count(people)
## [1] 10 12 NA 15

A Huge Thanks To This Bloke

And A Huge Thanks To You All