Data Wrangling in R

An Introduction

Author

Matteo Ploner

1 Introduction

1.1 Data manipulation, visualization and reporting

This course illustrates techniques for data manipulation, visualization and reporting using R and R Markdown
- Reference to the following sources is made during the course
  - Chang, Winston. 2012. R Graphics Cookbook: Practical Recipes for Visualizing Data. ” O’Reilly Media, Inc.”.
  - Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. ” O’Reilly Media, Inc.”.
  - W. N. Venables, W.N., Smith D. M., and R Core Team. 2019. An Introduction to R.
Many useful resources can be found online
- e.g., stackoverflow

1.2 R

From https://cran.r-project.org/
- R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
- R can be regarded as an implementation of the S language which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the of the S-PLUS systems.

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type ‘contributors()’ for more information and ‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or ‘help.start()’ for an HTML browser interface to help. Type ‘q()’ to quit R.

1.3 RStudio

From https://www.rstudio.com/products/RStudio/
- RStudio is an integrated development environment (IDE) for R
- RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).

1.4 Visual Studio Code

Alternatively, you can use Visual Studio Code with the R extension
- Copilot is a new AI pair programmer that helps you write better code, faster. https://copilot.github.com/ [indeed, this line was written by copilot ]

2 Tidy data

2.1 Your data in good shape

Data Wrangling is “the art of getting your data into R in a useful form for visualization and modeling” (Wickham and Grolemund 2016)
We rely on the tidyverse library
- The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

library(tidyverse)

On-line resources about tidyverse

2.2 The Workflow

Wickham and Grolemund (2016) present a description of the workflow in data management

2.3 Tidy data

See Wickham (2014) for a description of good practices in organizing your data ¹.
3 rules to have tidy data
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

Source: Wickham and Grolemund (2016)

Data that are not tidy are untidy!

2.4 Example of tidy data

Name	Height(mt)	Weight(kg)
Venonat	1.24	46.59
Ledyba	1.20	14.80
Wingull	0.49	7.05
Treeko	0.51	3.90
Zubat	0.90	8.66

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

2.5 Example of untidy data

Name	Key	Value
Ledyba	Height(mt)	1.20
Ledyba	Weight(kg)	14.80
Treeko	Height(mt)	0.51
Treeko	Weight(kg)	3.90
Venonat	Height(mt)	1.24
Venonat	Weight(kg)	46.59
Wingull	Height(mt)	0.49
Wingull	Weight(kg)	7.05
Zubat	Height(mt)	0.90
Zubat	Weight(kg)	8.66

3 Data wrangling

3.1 Coding Style

We rely on tidyverse syntax style
- The code we are going to write relies on pipes: %>%
- We use %>% to emphasize a sequence of action
  - From R > 4.1.0 the pipe is natively available: |>
- . is a placeholder

1:10 %>%
sum(.) %>%
.^2

[1] 3025

Without pipes

sum(1:10)^2

[1] 3025

3.2 Data transformation

We are performing a series of data transformation
- Import
- Mutate
- Select
- Rename
- Arrange
- Filter
- Join
- Gather
- Spread
- Summarise

3.3 Import

The first important step is to import your data in the R workspace
R can import several data formats
- My suggestion is to work with comma-separated (.csv) data
- In the tidyverse library, the following function is available

read_csv("PATH_TO_YOUR_FILE/FILENAME.csv")

Max Roser founder and director of Our World in Data

3.4 Tibbles

Data are imported in a tibble(Wickham and Grolemund 2016)
A special form of data frames in tidyverse
Refined print method that shows only the first 10 rows, and all the columns that fit on screen.

tibble(
  Col1= 1:3,
  Col2=4:6,
  Col3=7:9,
  Col4=10:12
)

# A tibble: 3 × 4
   Col1  Col2  Col3  Col4
  <int> <int> <int> <int>
1     1     4     7    10
2     2     5     8    11
3     3     6     9    12

3.5 Mutate

Adds new variables and preserves existing ones

tibble(Sequence=1:5) %>%
mutate(Sequence_rev=rev(Sequence))

# A tibble: 5 × 2
  Sequence Sequence_rev
     <int>        <int>
1        1            5
2        2            4
3        3            3
4        4            2
5        5            1

You can also overwrite old variables

tibble(Sequence=1:5) %>%
mutate(Sequence=rev(Sequence))

# A tibble: 5 × 1
  Sequence
     <int>
1        5
2        4
3        3
4        2
5        1

3.6 Select

We unselect one of the variables (e.g., z)

tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(-z)

# A tibble: 5 × 2
       x     y
   <dbl> <dbl>
1 0.797  1.93 
2 0.227  1.17 
3 0.937  0.347
4 0.476  0.762
5 0.0733 0.107

We explicitly select the variables of interest

tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(x,y)

# A tibble: 5 × 2
      x     y
  <dbl> <dbl>
1 0.112 1.85 
2 0.585 1.65 
3 0.268 0.474
4 0.577 0.448
5 0.224 0.889

3.7 Rename

Rename variables
- Make names more self-explanatory

tmp <- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp

# A tibble: 5 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1  1.64  1.33  5.48
2  3.93  1.66 19.2 
3  2.62  1.48  2.49
4  2.30  1.09  1.29
5  2.75  1.77 15.8

We may want to change the name of variables, to make them more self-explanatory

tmp <- tmp %>% rename(Length=x,Width=y,Height=z)
tmp

# A tibble: 5 × 3
  Length Width Height
   <dbl> <dbl>  <dbl>
1   1.64  1.33   5.48
2   3.93  1.66  19.2 
3   2.62  1.48   2.49
4   2.30  1.09   1.29
5   2.75  1.77  15.8

3.8 Arrange

# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      2.39  1.05
2 B      3.22  3.65
3 C      2.18  2.03
4 A      2.60  1.89
5 B      1.71  2.91
6 C      3.69  1.83

Sort your data according to values in one columns
- Important to get a first impression on rankings

tmp %>% arrange(Type)

# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      2.39  1.05
2 A      2.60  1.89
3 B      3.22  3.65
4 B      1.71  2.91
5 C      2.18  2.03
6 C      3.69  1.83

Possible to hierarchically sort according to several columns
- Sort first according to Type and then according to x

tmp %>% arrange(Type,x)

# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      2.39  1.05
2 A      2.60  1.89
3 B      1.71  2.91
4 B      3.22  3.65
5 C      2.18  2.03
6 C      3.69  1.83

3.9 Filter

To choose rows/cases where conditions are true

tmp <- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp

# A tibble: 5 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1  2.06  1.27  7.80
2  3.29  1.94  4.96
3  1.11  1.61  4.10
4  3.55  1.63 19.2 
5  2.24  1.52 17.5

tmp %>% filter(x>2 & y<2)

# A tibble: 4 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1  2.06  1.27  7.80
2  3.29  1.94  4.96
3  3.55  1.63 19.2 
4  2.24  1.52 17.5

3.10 Join

We may want to put two different data frames in relation
- The two data frames contain a common column
Values \(x\) and \(y\) for Obs are in table tmp.1
Values \(x\) and \(y\) for Obs are in table tmp.2

tmp.1 <- tibble(x=runif(5,1,4),y=runif(5,1,2),Obs=3:7)
tmp.1

# A tibble: 5 × 3
      x     y   Obs
  <dbl> <dbl> <int>
1  3.14  1.86     3
2  3.26  1.32     4
3  2.94  1.29     5
4  3.26  1.78     6
5  2.93  1.03     7

tmp.2 <- tibble(z=runif(5,1,8),w=runif(5,0,2),Obs=1:5)
tmp.2

# A tibble: 5 × 3
      z     w   Obs
  <dbl> <dbl> <int>
1  1.49 0.399     1
2  1.24 0.705     2
3  2.45 1.92      3
4  2.21 0.533     4
5  6.32 0.752     5

3.11 Join: full

full_join(tmp.1,tmp.2,by="Obs")

# A tibble: 7 × 5
      x     y   Obs     z      w
  <dbl> <dbl> <int> <dbl>  <dbl>
1  3.14  1.86     3  2.45  1.92 
2  3.26  1.32     4  2.21  0.533
3  2.94  1.29     5  6.32  0.752
4  3.26  1.78     6 NA    NA    
5  2.93  1.03     7 NA    NA    
6 NA    NA        1  1.49  0.399
7 NA    NA        2  1.24  0.705

! Not all Obs are in both tables

3.12 Join: partial

We may want to have all values of the “left” data frame (tmp.1)

left_join(tmp.1,tmp.2,by="Obs")

# A tibble: 5 × 5
      x     y   Obs     z      w
  <dbl> <dbl> <int> <dbl>  <dbl>
1  3.14  1.86     3  2.45  1.92 
2  3.26  1.32     4  2.21  0.533
3  2.94  1.29     5  6.32  0.752
4  3.26  1.78     6 NA    NA    
5  2.93  1.03     7 NA    NA

We may want to have all values of the “right” data frame (tmp.2)

right_join(tmp.1,tmp.2,by="Obs")

# A tibble: 5 × 5
      x     y   Obs     z     w
  <dbl> <dbl> <int> <dbl> <dbl>
1  3.14  1.86     3  2.45 1.92 
2  3.26  1.32     4  2.21 0.533
3  2.94  1.29     5  6.32 0.752
4 NA    NA        1  1.49 0.399
5 NA    NA        2  1.24 0.705

3.13 Gather

Pivot longer takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed

tmp.1

# A tibble: 5 × 3
      x     y   Obs
  <dbl> <dbl> <int>
1  3.14  1.86     3
2  3.26  1.32     4
3  2.94  1.29     5
4  3.26  1.78     6
5  2.93  1.03     7

tmp.2 <-
tmp.1 %>%
pivot_longer(names_to="Key",values_to="Value",c(x,y))
tmp.2

# A tibble: 10 × 3
     Obs Key   Value
   <int> <chr> <dbl>
 1     3 x      3.14
 2     3 y      1.86
 3     4 x      3.26
 4     4 y      1.32
 5     5 x      2.94
 6     5 y      1.29
 7     6 x      3.26
 8     6 y      1.78
 9     7 x      2.93
10     7 y      1.03

3.14 Spread

pivot_wider spreads a key-value pair across multiple columns (inverse of gather)

tmp.1 <- tmp.2 %>%
pivot_wider(names_from="Key",values_from="Value")
tmp.1

# A tibble: 5 × 3
    Obs     x     y
  <int> <dbl> <dbl>
1     3  3.14  1.86
2     4  3.26  1.32
3     5  2.94  1.29
4     6  3.26  1.78
5     7  2.93  1.03

3.15 Summarise

# A tibble: 6 × 3
  Type      x     y
  <chr> <dbl> <dbl>
1 A      3.12  2.93
2 B      2.70  3.32
3 C      1.27  1.03
4 A      3.50  2.41
5 B      3.96  3.48
6 C      3.60  1.54

Apply functions to your variables
- Mean, SD, …

tmp %>% summarise_at("x",list(~mean(.),~sd(.)))

# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  3.03 0.960

Very powerful in combination to group_by
- “Cluster” the application of functions

tmp %>% group_by(Type) %>%  summarise_at("x",list(~n(),~mean(.),~sd(.)))

# A tibble: 3 × 4
  Type      n  mean    sd
  <chr> <int> <dbl> <dbl>
1 A         2  3.31 0.267
2 B         2  3.33 0.892
3 C         2  2.44 1.64

4 An application

4.1 The data

We are going to use the diamonds data set
- It is available in the ggplot2 library see here
- It contains the prices and other attributes of almost 54,000 diamonds

diamonds %>% head() %>% kable()

carat	cut	color	clarity	depth	table	price	x	y	z
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48

4.2 Transformation

Create a unique ID for each diamond

diamonds <- diamonds |>
    mutate(ID = 1:n())

Convert price from USD to EUR
- The exchange rate today is 0.91

diamonds <- diamonds |>
    mutate(price_EUR = price * 0.91)

4.3 Filter and select

Filter only diamonds with carat > 1

diamonds_sel <-
diamonds |>
    filter(carat > 1)

Select only price_EUR, carat, cut, color, clarity

diamonds_sel <- 
diamonds_sel |>
    select(price_EUR, carat, cut, color, clarity) 

diamonds_sel |> head() |> kable()

price_EUR	carat	cut	color	clarity
2524.34	1.17	Very Good	J	I1
2530.71	1.01	Premium	F	I1
2537.08	1.01	Fair	E	I1
2537.08	1.01	Premium	H	SI2
2537.99	1.05	Very Good	J	SI2
2537.99	1.05	Fair	J	SI2

4.4 Analyze

Count diamonds by color

diamonds_sel |>
    count(color) |>
    arrange(desc(n)) |>
    kable()

color	n
G	3719
H	3637
I	2680
F	2601
E	1892
J	1652
D	1321

Relationship between cut and price_EUR

diamonds_sel |>
    group_by(cut) |>
    summarise(mean_price = mean(price_EUR)) |>
    arrange(desc(mean_price)) |>
    kable()

cut	mean_price
Ideal	7893.546
Premium	7723.396
Very Good	7589.899
Good	7055.777
Fair	6531.849

4.5 Visualize

Relationship between carat and price_EUR

diamonds_sel |>
    ggplot(aes(x = carat, y = price_EUR)) +
    geom_point(pch=1,color="grey50", alpha=.5)+
    geom_smooth()+
    theme_bw()+
    labs(x="Carat",y="Price (EUR)")

5 Appendix

5.1 References

References

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".

Footnotes

A condensed online version available here.↩︎