library(tidyverse)
Data Wrangling in R
An Introduction
1 Introduction
1.1 Data manipulation, visualization and reporting
- This course illustrates techniques for data manipulation, visualization and reporting using R and R Markdown
- Reference to the following sources is made during the course
- Chang, Winston. 2012. R Graphics Cookbook: Practical Recipes for Visualizing Data. ” O’Reilly Media, Inc.”.
- Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. ” O’Reilly Media, Inc.”.
- W. N. Venables, W.N., Smith D. M., and R Core Team. 2019. An Introduction to R.
- Reference to the following sources is made during the course
- Many useful resources can be found online
- e.g., stackoverflow
1.2 R
- From https://cran.r-project.org/
- R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
- R can be regarded as an implementation of the S language which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the of the S-PLUS systems.
R version 3.6.2 (2019-12-12) – “Dark and Stormy Night” Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors. Type ‘contributors()’ for more information and ‘citation()’ on how to cite R or R packages in publications.
Type ‘demo()’ for some demos, ‘help()’ for on-line help, or ‘help.start()’ for an HTML browser interface to help. Type ‘q()’ to quit R.
1.3 RStudio
- From https://www.rstudio.com/products/RStudio/
- RStudio is an integrated development environment (IDE) for R
- RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).
1.4 Visual Studio Code
- Alternatively, you can use Visual Studio Code with the R extension
- Copilot is a new AI pair programmer that helps you write better code, faster. https://copilot.github.com/ [indeed, this line was written by copilot ]
2 Tidy data
2.1 Your data in good shape
Data Wrangling is “the art of getting your data into R in a useful form for visualization and modeling” (Wickham and Grolemund 2016)
We rely on the tidyverse library
- The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
- On-line resources about tidyverse
2.2 The Workflow
- Wickham and Grolemund (2016) present a description of the workflow in data management
2.3 Tidy data
See Wickham (2014) for a description of good practices in organizing your data 1.
3 rules to have tidy data
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Source: Wickham and Grolemund (2016)
- Data that are not tidy are untidy!
2.4 Example of tidy data
Name | Height(mt) | Weight(kg) |
---|---|---|
Venonat | 1.24 | 46.59 |
Ledyba | 1.20 | 14.80 |
Wingull | 0.49 | 7.05 |
Treeko | 0.51 | 3.90 |
Zubat | 0.90 | 8.66 |
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
2.5 Example of untidy data
Name | Key | Value |
---|---|---|
Ledyba | Height(mt) | 1.20 |
Ledyba | Weight(kg) | 14.80 |
Treeko | Height(mt) | 0.51 |
Treeko | Weight(kg) | 3.90 |
Venonat | Height(mt) | 1.24 |
Venonat | Weight(kg) | 46.59 |
Wingull | Height(mt) | 0.49 |
Wingull | Weight(kg) | 7.05 |
Zubat | Height(mt) | 0.90 |
Zubat | Weight(kg) | 8.66 |
3 Data wrangling
3.1 Coding Style
- We rely on tidyverse syntax style
- The code we are going to write relies on pipes: %>%
- We use %>% to emphasize a sequence of action
- From R > 4.1.0 the pipe is natively available:
|>
- From R > 4.1.0 the pipe is natively available:
- . is a placeholder
1:10 %>%
sum(.) %>%
^2 .
[1] 3025
- Without pipes
sum(1:10)^2
[1] 3025
3.2 Data transformation
- We are performing a series of data transformation
- Import
- Mutate
- Select
- Rename
- Arrange
- Filter
- Join
- Gather
- Spread
- Summarise
3.3 Import
- The first important step is to import your data in the R workspace
- R can import several data formats
- My suggestion is to work with comma-separated (.csv) data
- In the tidyverse library, the following function is available
read_csv("PATH_TO_YOUR_FILE/FILENAME.csv")
- Max Roser founder and director of Our World in Data
3.4 Tibbles
- Data are imported in a tibble(Wickham and Grolemund 2016)
- A special form of data frames in tidyverse
- Refined print method that shows only the first 10 rows, and all the columns that fit on screen.
tibble(
Col1= 1:3,
Col2=4:6,
Col3=7:9,
Col4=10:12
)
# A tibble: 3 × 4
Col1 Col2 Col3 Col4
<int> <int> <int> <int>
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
3.5 Mutate
- Adds new variables and preserves existing ones
tibble(Sequence=1:5) %>%
mutate(Sequence_rev=rev(Sequence))
# A tibble: 5 × 2
Sequence Sequence_rev
<int> <int>
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1
- You can also overwrite old variables
tibble(Sequence=1:5) %>%
mutate(Sequence=rev(Sequence))
# A tibble: 5 × 1
Sequence
<int>
1 5
2 4
3 3
4 2
5 1
3.6 Select
- We unselect one of the variables (e.g., z)
tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(-z)
# A tibble: 5 × 2
x y
<dbl> <dbl>
1 0.797 1.93
2 0.227 1.17
3 0.937 0.347
4 0.476 0.762
5 0.0733 0.107
- We explicitly select the variables of interest
tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(x,y)
# A tibble: 5 × 2
x y
<dbl> <dbl>
1 0.112 1.85
2 0.585 1.65
3 0.268 0.474
4 0.577 0.448
5 0.224 0.889
3.7 Rename
- Rename variables
- Make names more self-explanatory
<- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp tmp
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 1.64 1.33 5.48
2 3.93 1.66 19.2
3 2.62 1.48 2.49
4 2.30 1.09 1.29
5 2.75 1.77 15.8
- We may want to change the name of variables, to make them more self-explanatory
<- tmp %>% rename(Length=x,Width=y,Height=z)
tmp tmp
# A tibble: 5 × 3
Length Width Height
<dbl> <dbl> <dbl>
1 1.64 1.33 5.48
2 3.93 1.66 19.2
3 2.62 1.48 2.49
4 2.30 1.09 1.29
5 2.75 1.77 15.8
3.8 Arrange
# A tibble: 6 × 3
Type x y
<chr> <dbl> <dbl>
1 A 2.39 1.05
2 B 3.22 3.65
3 C 2.18 2.03
4 A 2.60 1.89
5 B 1.71 2.91
6 C 3.69 1.83
- Sort your data according to values in one columns
- Important to get a first impression on rankings
%>% arrange(Type) tmp
# A tibble: 6 × 3
Type x y
<chr> <dbl> <dbl>
1 A 2.39 1.05
2 A 2.60 1.89
3 B 3.22 3.65
4 B 1.71 2.91
5 C 2.18 2.03
6 C 3.69 1.83
- Possible to hierarchically sort according to several columns
- Sort first according to Type and then according to x
%>% arrange(Type,x) tmp
# A tibble: 6 × 3
Type x y
<chr> <dbl> <dbl>
1 A 2.39 1.05
2 A 2.60 1.89
3 B 1.71 2.91
4 B 3.22 3.65
5 C 2.18 2.03
6 C 3.69 1.83
3.9 Filter
- To choose rows/cases where conditions are true
<- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp tmp
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 2.06 1.27 7.80
2 3.29 1.94 4.96
3 1.11 1.61 4.10
4 3.55 1.63 19.2
5 2.24 1.52 17.5
%>% filter(x>2 & y<2) tmp
# A tibble: 4 × 3
x y z
<dbl> <dbl> <dbl>
1 2.06 1.27 7.80
2 3.29 1.94 4.96
3 3.55 1.63 19.2
4 2.24 1.52 17.5
3.10 Join
- We may want to put two different data frames in relation
- The two data frames contain a common column
- Values \(x\) and \(y\) for Obs are in table tmp.1
- Values \(x\) and \(y\) for Obs are in table tmp.2
.1 <- tibble(x=runif(5,1,4),y=runif(5,1,2),Obs=3:7)
tmp.1 tmp
# A tibble: 5 × 3
x y Obs
<dbl> <dbl> <int>
1 3.14 1.86 3
2 3.26 1.32 4
3 2.94 1.29 5
4 3.26 1.78 6
5 2.93 1.03 7
.2 <- tibble(z=runif(5,1,8),w=runif(5,0,2),Obs=1:5)
tmp.2 tmp
# A tibble: 5 × 3
z w Obs
<dbl> <dbl> <int>
1 1.49 0.399 1
2 1.24 0.705 2
3 2.45 1.92 3
4 2.21 0.533 4
5 6.32 0.752 5
3.11 Join: full
full_join(tmp.1,tmp.2,by="Obs")
# A tibble: 7 × 5
x y Obs z w
<dbl> <dbl> <int> <dbl> <dbl>
1 3.14 1.86 3 2.45 1.92
2 3.26 1.32 4 2.21 0.533
3 2.94 1.29 5 6.32 0.752
4 3.26 1.78 6 NA NA
5 2.93 1.03 7 NA NA
6 NA NA 1 1.49 0.399
7 NA NA 2 1.24 0.705
- ! Not all Obs are in both tables
3.12 Join: partial
- We may want to have all values of the “left” data frame (tmp.1)
left_join(tmp.1,tmp.2,by="Obs")
# A tibble: 5 × 5
x y Obs z w
<dbl> <dbl> <int> <dbl> <dbl>
1 3.14 1.86 3 2.45 1.92
2 3.26 1.32 4 2.21 0.533
3 2.94 1.29 5 6.32 0.752
4 3.26 1.78 6 NA NA
5 2.93 1.03 7 NA NA
- We may want to have all values of the “right” data frame (tmp.2)
right_join(tmp.1,tmp.2,by="Obs")
# A tibble: 5 × 5
x y Obs z w
<dbl> <dbl> <int> <dbl> <dbl>
1 3.14 1.86 3 2.45 1.92
2 3.26 1.32 4 2.21 0.533
3 2.94 1.29 5 6.32 0.752
4 NA NA 1 1.49 0.399
5 NA NA 2 1.24 0.705
3.13 Gather
- Pivot longer takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed
.1 tmp
# A tibble: 5 × 3
x y Obs
<dbl> <dbl> <int>
1 3.14 1.86 3
2 3.26 1.32 4
3 2.94 1.29 5
4 3.26 1.78 6
5 2.93 1.03 7
.2 <-
tmp.1 %>%
tmppivot_longer(names_to="Key",values_to="Value",c(x,y))
.2 tmp
# A tibble: 10 × 3
Obs Key Value
<int> <chr> <dbl>
1 3 x 3.14
2 3 y 1.86
3 4 x 3.26
4 4 y 1.32
5 5 x 2.94
6 5 y 1.29
7 6 x 3.26
8 6 y 1.78
9 7 x 2.93
10 7 y 1.03
3.14 Spread
- pivot_wider spreads a key-value pair across multiple columns (inverse of gather)
.1 <- tmp.2 %>%
tmppivot_wider(names_from="Key",values_from="Value")
.1 tmp
# A tibble: 5 × 3
Obs x y
<int> <dbl> <dbl>
1 3 3.14 1.86
2 4 3.26 1.32
3 5 2.94 1.29
4 6 3.26 1.78
5 7 2.93 1.03
3.15 Summarise
# A tibble: 6 × 3
Type x y
<chr> <dbl> <dbl>
1 A 3.12 2.93
2 B 2.70 3.32
3 C 1.27 1.03
4 A 3.50 2.41
5 B 3.96 3.48
6 C 3.60 1.54
- Apply functions to your variables
- Mean, SD, …
%>% summarise_at("x",list(~mean(.),~sd(.))) tmp
# A tibble: 1 × 2
mean sd
<dbl> <dbl>
1 3.03 0.960
- Very powerful in combination to
group_by
- “Cluster” the application of functions
%>% group_by(Type) %>% summarise_at("x",list(~n(),~mean(.),~sd(.))) tmp
# A tibble: 3 × 4
Type n mean sd
<chr> <int> <dbl> <dbl>
1 A 2 3.31 0.267
2 B 2 3.33 0.892
3 C 2 2.44 1.64
4 An application
4.1 The data
- We are going to use the diamonds data set
- It is available in the ggplot2 library see here
- It contains the prices and other attributes of almost 54,000 diamonds
%>% head() %>% kable() diamonds
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
4.2 Transformation
- Create a unique ID for each diamond
<- diamonds |>
diamonds mutate(ID = 1:n())
- Convert price from USD to EUR
- The exchange rate today is 0.91
<- diamonds |>
diamonds mutate(price_EUR = price * 0.91)
4.3 Filter and select
- Filter only diamonds with
carat
> 1
<-
diamonds_sel |>
diamonds filter(carat > 1)
- Select only
price_EUR
,carat
,cut
,color
,clarity
<-
diamonds_sel |>
diamonds_sel select(price_EUR, carat, cut, color, clarity)
|> head() |> kable() diamonds_sel
price_EUR | carat | cut | color | clarity |
---|---|---|---|---|
2524.34 | 1.17 | Very Good | J | I1 |
2530.71 | 1.01 | Premium | F | I1 |
2537.08 | 1.01 | Fair | E | I1 |
2537.08 | 1.01 | Premium | H | SI2 |
2537.99 | 1.05 | Very Good | J | SI2 |
2537.99 | 1.05 | Fair | J | SI2 |
4.4 Analyze
- Count diamonds by
color
|>
diamonds_sel count(color) |>
arrange(desc(n)) |>
kable()
color | n |
---|---|
G | 3719 |
H | 3637 |
I | 2680 |
F | 2601 |
E | 1892 |
J | 1652 |
D | 1321 |
- Relationship between
cut
andprice_EUR
|>
diamonds_sel group_by(cut) |>
summarise(mean_price = mean(price_EUR)) |>
arrange(desc(mean_price)) |>
kable()
cut | mean_price |
---|---|
Ideal | 7893.546 |
Premium | 7723.396 |
Very Good | 7589.899 |
Good | 7055.777 |
Fair | 6531.849 |
4.5 Visualize
- Relationship between
carat
andprice_EUR
|>
diamonds_sel ggplot(aes(x = carat, y = price_EUR)) +
geom_point(pch=1,color="grey50", alpha=.5)+
geom_smooth()+
theme_bw()+
labs(x="Carat",y="Price (EUR)")