Data Wrangling in R

An Introduction

Author

Matteo Ploner

1 Introduction

1.1 Data manipulation, visualization and reporting

  • This course illustrates techniques for data manipulation, visualization and reporting using R and R Markdown
    • Reference to the following sources is made during the course
      • Chang, Winston. 2012. R Graphics Cookbook: Practical Recipes for Visualizing Data. ” O’Reilly Media, Inc.”.
      • Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. ” O’Reilly Media, Inc.”.
      • W. N. Venables, W.N., Smith D. M., and R Core Team. 2019. An Introduction to R.
  • Many useful resources can be found online

1.2 R

  • From https://cran.r-project.org/
    • R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
    • R can be regarded as an implementation of the S language which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the of the S-PLUS systems.

R version 3.6.2 (2019-12-12) – “Dark and Stormy Night” Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type ‘contributors()’ for more information and ‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or ‘help.start()’ for an HTML browser interface to help. Type ‘q()’ to quit R.

1.3 RStudio

  • From https://www.rstudio.com/products/RStudio/
    • RStudio is an integrated development environment (IDE) for R
    • RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).

1.4 Visual Studio Code

2 Tidy data

2.1 Your data in good shape

  • Data Wrangling is “the art of getting your data into R in a useful form for visualization and modeling(Wickham and Grolemund 2016)

  • We rely on the tidyverse library

    • The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
library(tidyverse)

2.2 The Workflow

  • Wickham and Grolemund (2016) present a description of the workflow in data management

2.3 Tidy data

  • See Wickham (2014) for a description of good practices in organizing your data 1.

  • 3 rules to have tidy data

    • Each variable forms a column.
    • Each observation forms a row.
    • Each type of observational unit forms a table.


Source: Wickham and Grolemund (2016)

  • Data that are not tidy are untidy!

2.4 Example of tidy data

Name Height(mt) Weight(kg)
Venonat 1.24 46.59
Ledyba 1.20 14.80
Wingull 0.49 7.05
Treeko 0.51 3.90
Zubat 0.90 8.66
  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

2.5 Example of untidy data

Name Key Value
Ledyba Height(mt) 1.20
Ledyba Weight(kg) 14.80
Treeko Height(mt) 0.51
Treeko Weight(kg) 3.90
Venonat Height(mt) 1.24
Venonat Weight(kg) 46.59
Wingull Height(mt) 0.49
Wingull Weight(kg) 7.05
Zubat Height(mt) 0.90
Zubat Weight(kg) 8.66

3 Data wrangling

3.1 Coding Style

  • We rely on tidyverse syntax style
    • The code we are going to write relies on pipes: %>%
    • We use %>% to emphasize a sequence of action
      • From R > 4.1.0 the pipe is natively available: |>
    • . is a placeholder
1:10 %>%
sum(.) %>%
.^2
[1] 3025
  • Without pipes
sum(1:10)^2
[1] 3025

3.2 Data transformation

  • We are performing a series of data transformation
    • Import
    • Mutate
    • Select
    • Rename
    • Arrange
    • Filter
    • Join
    • Gather
    • Spread
    • Summarise

3.3 Import

  • The first important step is to import your data in the R workspace
  • R can import several data formats
    • My suggestion is to work with comma-separated (.csv) data
    • In the tidyverse library, the following function is available
read_csv("PATH_TO_YOUR_FILE/FILENAME.csv")

3.4 Tibbles

  • Data are imported in a tibble(Wickham and Grolemund 2016)
  • A special form of data frames in tidyverse
  • Refined print method that shows only the first 10 rows, and all the columns that fit on screen.
tibble(
  Col1= 1:3,
  Col2=4:6,
  Col3=7:9,
  Col4=10:12
)
# A tibble: 3 × 4
   Col1  Col2  Col3  Col4
  <int> <int> <int> <int>
1     1     4     7    10
2     2     5     8    11
3     3     6     9    12

3.5 Mutate

  • Adds new variables and preserves existing ones
tibble(Sequence=1:5) %>%
mutate(Sequence_rev=rev(Sequence))
# A tibble: 5 × 2
  Sequence Sequence_rev
     <int>        <int>
1        1            5
2        2            4
3        3            3
4        4            2
5        5            1
  • You can also overwrite old variables
tibble(Sequence=1:5) %>%
mutate(Sequence=rev(Sequence))
# A tibble: 5 × 1
  Sequence
     <int>
1        5
2        4
3        3
4        2
5        1

3.6 Select

  • We unselect one of the variables (e.g., z)
tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(-z)
# A tibble: 5 × 2
       x     y
   <dbl> <dbl>
1 0.797  1.93 
2 0.227  1.17 
3 0.937  0.347
4 0.476  0.762
5 0.0733 0.107
  • We explicitly select the variables of interest
tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(x,y)
# A tibble: 5 × 2
      x     y
  <dbl> <dbl>
1 0.112 1.85 
2 0.585 1.65 
3 0.268 0.474
4 0.577 0.448
5 0.224 0.889

3.7 Rename

  • Rename variables
    • Make names more self-explanatory
tmp <- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp
# A tibble: 5 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1  1.64  1.33  5.48
2  3.93  1.66 19.2 
3  2.62  1.48  2.49
4  2.30  1.09  1.29
5  2.75  1.77 15.8 
  • We may want to change the name of variables, to make them more self-explanatory
tmp <- tmp %>% rename(Length=x,Width=y,Height=z)
tmp
# A tibble: 5 × 3
  Length Width Height
   <dbl> <dbl>  <dbl>
1   1.64  1.33   5.48
2   3.93  1.66  19.2 
3   2.62  1.48   2.49
4   2.30  1.09   1.29
5   2.75  1.77  15.8 

3.8 Arrange

# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      2.39  1.05
2 B      3.22  3.65
3 C      2.18  2.03
4 A      2.60  1.89
5 B      1.71  2.91
6 C      3.69  1.83
  • Sort your data according to values in one columns
    • Important to get a first impression on rankings
tmp %>% arrange(Type)
# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      2.39  1.05
2 A      2.60  1.89
3 B      3.22  3.65
4 B      1.71  2.91
5 C      2.18  2.03
6 C      3.69  1.83
  • Possible to hierarchically sort according to several columns
    • Sort first according to Type and then according to x
tmp %>% arrange(Type,x)
# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      2.39  1.05
2 A      2.60  1.89
3 B      1.71  2.91
4 B      3.22  3.65
5 C      2.18  2.03
6 C      3.69  1.83

3.9 Filter

  • To choose rows/cases where conditions are true
tmp <- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp
# A tibble: 5 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1  2.06  1.27  7.80
2  3.29  1.94  4.96
3  1.11  1.61  4.10
4  3.55  1.63 19.2 
5  2.24  1.52 17.5 
tmp %>% filter(x>2 & y<2)
# A tibble: 4 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1  2.06  1.27  7.80
2  3.29  1.94  4.96
3  3.55  1.63 19.2 
4  2.24  1.52 17.5 

3.10 Join

  • We may want to put two different data frames in relation
    • The two data frames contain a common column
  • Values \(x\) and \(y\) for Obs are in table tmp.1
  • Values \(x\) and \(y\) for Obs are in table tmp.2
tmp.1 <- tibble(x=runif(5,1,4),y=runif(5,1,2),Obs=3:7)
tmp.1
# A tibble: 5 × 3
      x     y   Obs
  <dbl> <dbl> <int>
1  3.14  1.86     3
2  3.26  1.32     4
3  2.94  1.29     5
4  3.26  1.78     6
5  2.93  1.03     7
tmp.2 <- tibble(z=runif(5,1,8),w=runif(5,0,2),Obs=1:5)
tmp.2
# A tibble: 5 × 3
      z     w   Obs
  <dbl> <dbl> <int>
1  1.49 0.399     1
2  1.24 0.705     2
3  2.45 1.92      3
4  2.21 0.533     4
5  6.32 0.752     5

3.11 Join: full

full_join(tmp.1,tmp.2,by="Obs")
# A tibble: 7 × 5
      x     y   Obs     z      w
  <dbl> <dbl> <int> <dbl>  <dbl>
1  3.14  1.86     3  2.45  1.92 
2  3.26  1.32     4  2.21  0.533
3  2.94  1.29     5  6.32  0.752
4  3.26  1.78     6 NA    NA    
5  2.93  1.03     7 NA    NA    
6 NA    NA        1  1.49  0.399
7 NA    NA        2  1.24  0.705
  • ! Not all Obs are in both tables

3.12 Join: partial

  • We may want to have all values of the “left” data frame (tmp.1)
left_join(tmp.1,tmp.2,by="Obs")
# A tibble: 5 × 5
      x     y   Obs     z      w
  <dbl> <dbl> <int> <dbl>  <dbl>
1  3.14  1.86     3  2.45  1.92 
2  3.26  1.32     4  2.21  0.533
3  2.94  1.29     5  6.32  0.752
4  3.26  1.78     6 NA    NA    
5  2.93  1.03     7 NA    NA    
  • We may want to have all values of the “right” data frame (tmp.2)
right_join(tmp.1,tmp.2,by="Obs")
# A tibble: 5 × 5
      x     y   Obs     z     w
  <dbl> <dbl> <int> <dbl> <dbl>
1  3.14  1.86     3  2.45 1.92 
2  3.26  1.32     4  2.21 0.533
3  2.94  1.29     5  6.32 0.752
4 NA    NA        1  1.49 0.399
5 NA    NA        2  1.24 0.705

3.13 Gather

  • Pivot longer takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed
tmp.1
# A tibble: 5 × 3
      x     y   Obs
  <dbl> <dbl> <int>
1  3.14  1.86     3
2  3.26  1.32     4
3  2.94  1.29     5
4  3.26  1.78     6
5  2.93  1.03     7
tmp.2 <-
tmp.1 %>%
pivot_longer(names_to="Key",values_to="Value",c(x,y))
tmp.2
# A tibble: 10 × 3
     Obs Key   Value
   <int> <chr> <dbl>
 1     3 x      3.14
 2     3 y      1.86
 3     4 x      3.26
 4     4 y      1.32
 5     5 x      2.94
 6     5 y      1.29
 7     6 x      3.26
 8     6 y      1.78
 9     7 x      2.93
10     7 y      1.03

3.14 Spread

  • pivot_wider spreads a key-value pair across multiple columns (inverse of gather)
tmp.1 <- tmp.2 %>%
pivot_wider(names_from="Key",values_from="Value")
tmp.1
# A tibble: 5 × 3
    Obs     x     y
  <int> <dbl> <dbl>
1     3  3.14  1.86
2     4  3.26  1.32
3     5  2.94  1.29
4     6  3.26  1.78
5     7  2.93  1.03

3.15 Summarise

# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      3.12  2.93
2 B      2.70  3.32
3 C      1.27  1.03
4 A      3.50  2.41
5 B      3.96  3.48
6 C      3.60  1.54
  • Apply functions to your variables
    • Mean, SD, …
tmp %>% summarise_at("x",list(~mean(.),~sd(.)))
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  3.03 0.960
  • Very powerful in combination to group_by
    • “Cluster” the application of functions
tmp %>% group_by(Type) %>%  summarise_at("x",list(~n(),~mean(.),~sd(.)))
# A tibble: 3 × 4
  Type      n  mean    sd
  <chr> <int> <dbl> <dbl>
1 A         2  3.31 0.267
2 B         2  3.33 0.892
3 C         2  2.44 1.64 

4 An application

4.1 The data

  • We are going to use the diamonds data set
    • It is available in the ggplot2 library see here
    • It contains the prices and other attributes of almost 54,000 diamonds
diamonds %>% head() %>% kable()
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

4.2 Transformation

  • Create a unique ID for each diamond
diamonds <- diamonds |>
    mutate(ID = 1:n())
  • Convert price from USD to EUR
    • The exchange rate today is 0.91
diamonds <- diamonds |>
    mutate(price_EUR = price * 0.91)

4.3 Filter and select

  • Filter only diamonds with carat > 1
diamonds_sel <-
diamonds |>
    filter(carat > 1) 
  • Select only price_EUR, carat, cut, color, clarity
diamonds_sel <- 
diamonds_sel |>
    select(price_EUR, carat, cut, color, clarity) 

diamonds_sel |> head() |> kable()
price_EUR carat cut color clarity
2524.34 1.17 Very Good J I1
2530.71 1.01 Premium F I1
2537.08 1.01 Fair E I1
2537.08 1.01 Premium H SI2
2537.99 1.05 Very Good J SI2
2537.99 1.05 Fair J SI2

4.4 Analyze

  • Count diamonds by color
diamonds_sel |>
    count(color) |>
    arrange(desc(n)) |>
    kable()
color n
G 3719
H 3637
I 2680
F 2601
E 1892
J 1652
D 1321
  • Relationship between cut and price_EUR
diamonds_sel |>
    group_by(cut) |>
    summarise(mean_price = mean(price_EUR)) |>
    arrange(desc(mean_price)) |>
    kable()
cut mean_price
Ideal 7893.546
Premium 7723.396
Very Good 7589.899
Good 7055.777
Fair 6531.849

4.5 Visualize

  • Relationship between carat and price_EUR
diamonds_sel |>
    ggplot(aes(x = carat, y = price_EUR)) +
    geom_point(pch=1,color="grey50", alpha=.5)+
    geom_smooth()+
    theme_bw()+
    labs(x="Carat",y="Price (EUR)")

5 Appendix

5.1 References

References

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".

Footnotes

  1. A condensed online version available here.↩︎