Introduction

Data Wrangling

Data Wrangling is “the art of getting your data into R in a useful form for visualization and modeling” (Wickham and Grolemund 2016)
We rely on the tidyverse library
- The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

library(tidyverse)

On-line resources about tidyverse

Tidy data

See Wickham (2014) for a description of good practices in organizing your data.¹
3 rules to have tidy data
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

Source: Wickham and Grolemund (2016)

Data that are not tidy are untidy!

Example of tidy data

  # A tibble: 5 x 3
    Name    `Height(mt)` `Weight(kg)`
    <chr>          <dbl>        <dbl>
  1 Venonat         1.24        46.6 
  2 Ledyba          1.2         14.8 
  3 Wingull         0.49         7.05
  4 Treeko          0.51         3.9 
  5 Zubat           0.9          8.66

Example of untidy data

  # A tibble: 10 x 3
     Name    Key        Value
     <chr>   <chr>      <dbl>
   1 Ledyba  Height(mt)  1.2 
   2 Ledyba  Weight(kg) 14.8 
   3 Treeko  Height(mt)  0.51
   4 Treeko  Weight(kg)  3.9 
   5 Venonat Height(mt)  1.24
   6 Venonat Weight(kg) 46.6 
   7 Wingull Height(mt)  0.49
   8 Wingull Weight(kg)  7.05
   9 Zubat   Height(mt)  0.9 
  10 Zubat   Weight(kg)  8.66

Data wrangling

Coding Style

The code we are going to write relies on pipes: %>%
We use %>% to emphasize a sequence of action
- . is a placeholder

1:10 %>%
sum(.) %>%
.^2

  [1] 3025

Without pipes

sum(1:10)^2

  [1] 3025

Mutate

Adds new variables and preserves existing ones

tibble(Sequence=1:5) %>%
mutate(Sequence_rev=rev(Sequence))

  # A tibble: 5 x 2
    Sequence Sequence_rev
       <int>        <int>
  1        1            5
  2        2            4
  3        3            3
  4        4            2
  5        5            1

You can also overwrite old variables

tibble(Sequence=1:5) %>%
mutate(Sequence=rev(Sequence))

  # A tibble: 5 x 1
    Sequence
       <int>
  1        5
  2        4
  3        3
  4        2
  5        1

Select

Choose only a few variables of the data frame
We not select one of the variables (e.g., z)

tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(-z)

  # A tibble: 5 x 2
         x     y
     <dbl> <dbl>
  1 0.171  0.961
  2 0.545  1.77 
  3 0.548  1.29 
  4 0.0527 0.354
  5 0.654  0.927

We explicitly select the variables of interest

tibble(x=runif(5,0,1),y=runif(5,0,2),z=runif(5,-1,1)) %>%
select(x,y)

  # A tibble: 5 x 2
         x     y
     <dbl> <dbl>
  1 0.849  0.775
  2 0.584  0.140
  3 0.876  0.442
  4 0.792  0.421
  5 0.0208 0.669

Rename

We may want to change the name of variables, to make them more self-explanatory

tmp <- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp

  # A tibble: 5 x 3
        x     y     z
    <dbl> <dbl> <dbl>
  1  1.76  1.95 12.1 
  2  3.14  1.55 18.8 
  3  1.68  1.54 15.3 
  4  3.61  1.33  1.92
  5  2.15  1.41 17.5

tmp <- tmp %>% rename(Length=x,Width=y,Height=z)
tmp

  # A tibble: 5 x 3
    Length Width Height
     <dbl> <dbl>  <dbl>
  1   1.76  1.95  12.1 
  2   3.14  1.55  18.8 
  3   1.68  1.54  15.3 
  4   3.61  1.33   1.92
  5   2.15  1.41  17.5

Arrange

Sort you data according to values in one columns
- Important to get a first impression on rankings
  - You can hierarchically sort according to several columns
According to length

tmp <- tibble(Type=c(LETTERS[1:5],LETTERS[1:5]),x=runif(10,1,4),y=runif(10,1,4))
tmp %>% arrange(Type)

  # A tibble: 10 x 3
     Type      x     y
     <chr> <dbl> <dbl>
   1 A      3.73  1.18
   2 A      3.33  3.24
   3 B      3.63  3.39
   4 B      3.62  3.65
   5 C      1.51  2.30
   6 C      1.10  1.24
   7 D      2.55  2.27
   8 D      1.18  2.01
   9 E      2.02  3.05
  10 E      1.63  3.70

tmp %>% arrange(Type,x)

  # A tibble: 10 x 3
     Type      x     y
     <chr> <dbl> <dbl>
   1 A      3.33  3.24
   2 A      3.73  1.18
   3 B      3.62  3.65
   4 B      3.63  3.39
   5 C      1.10  1.24
   6 C      1.51  2.30
   7 D      1.18  2.01
   8 D      2.55  2.27
   9 E      1.63  3.70
  10 E      2.02  3.05

Filter

To choose rows/cases where conditions are true

tmp <- tibble(x=runif(5,1,4),y=runif(5,1,2),z=runif(5,1,20))
tmp

  # A tibble: 5 x 3
        x     y     z
    <dbl> <dbl> <dbl>
  1  2.57  1.79 13.8 
  2  1.99  1.42 14.4 
  3  2.20  1.36 11.3 
  4  1.47  1.16  2.18
  5  1.42  1.53  8.17

tmp %>% filter(x>2 & y<2)

  # A tibble: 2 x 3
        x     y     z
    <dbl> <dbl> <dbl>
  1  2.57  1.79  13.8
  2  2.20  1.36  11.3

Join: full

We may want to put two different data frames in relation
- The two data frames contain a common column
Values \(x\) and \(y\) for Obs are in table tmp.1
Values \(x\) and \(y\) for Obs are in table tmp.2

tmp.1 <- tibble(x=runif(5,1,4),y=runif(5,1,2),Obs=3:7)
tmp.1

  # A tibble: 5 x 3
        x     y   Obs
    <dbl> <dbl> <int>
  1  1.72  1.12     3
  2  1.36  1.46     4
  3  3.83  1.62     5
  4  3.27  1.86     6
  5  1.60  1.49     7

tmp.2 <- tibble(z=runif(5,1,8),w=runif(5,0,2),Obs=1:5)
tmp.2

  # A tibble: 5 x 3
        z     w   Obs
    <dbl> <dbl> <int>
  1  2.62 0.230     1
  2  2.74 1.94      2
  3  3.65 1.59      3
  4  7.92 1.06      4
  5  2.10 1.33      5

Join: full (ii)

full_join(tmp.1,tmp.2,by="Obs")

  # A tibble: 7 x 5
        x     y   Obs     z      w
    <dbl> <dbl> <int> <dbl>  <dbl>
  1  1.72  1.12     3  3.65  1.59 
  2  1.36  1.46     4  7.92  1.06 
  3  3.83  1.62     5  2.10  1.33 
  4  3.27  1.86     6 NA    NA    
  5  1.60  1.49     7 NA    NA    
  6 NA    NA        1  2.62  0.230
  7 NA    NA        2  2.74  1.94

! Not all Obs are in both tables

Join: partial

We may want to have all values of the “left” data frame (tmp.1)

left_join(tmp.1,tmp.2,by="Obs")

  # A tibble: 5 x 5
        x     y   Obs     z     w
    <dbl> <dbl> <int> <dbl> <dbl>
  1  1.72  1.12     3  3.65  1.59
  2  1.36  1.46     4  7.92  1.06
  3  3.83  1.62     5  2.10  1.33
  4  3.27  1.86     6 NA    NA   
  5  1.60  1.49     7 NA    NA

We may want to have all values of the “right” data frame (tmp.2)

right_join(tmp.1,tmp.2,by="Obs")

  # A tibble: 5 x 5
        x     y   Obs     z     w
    <dbl> <dbl> <int> <dbl> <dbl>
  1 NA    NA        1  2.62 0.230
  2 NA    NA        2  2.74 1.94 
  3  1.72  1.12     3  3.65 1.59 
  4  1.36  1.46     4  7.92 1.06 
  5  3.83  1.62     5  2.10 1.33

Gathering

Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed

tmp.1

  # A tibble: 5 x 3
        x     y   Obs
    <dbl> <dbl> <int>
  1  1.72  1.12     3
  2  1.36  1.46     4
  3  3.83  1.62     5
  4  3.27  1.86     6
  5  1.60  1.49     7

tmp.1<- tmp.1 %>%
gather("Key","Value",c(x,y))
tmp.1

  # A tibble: 10 x 3
       Obs Key   Value
     <int> <chr> <dbl>
   1     3 x      1.72
   2     4 x      1.36
   3     5 x      3.83
   4     6 x      3.27
   5     7 x      1.60
   6     3 y      1.12
   7     4 y      1.46
   8     5 y      1.62
   9     6 y      1.86
  10     7 y      1.49

Spread

spread spreads a key-value pair across multiple columns (inverse of gather)

tmp.1 %>%
spread("Key","Value")

  # A tibble: 5 x 3
      Obs     x     y
    <int> <dbl> <dbl>
  1     3  1.72  1.12
  2     4  1.36  1.46
  3     5  3.83  1.62
  4     6  3.27  1.86
  5     7  1.60  1.49

An Application to Experimental Data

The Experiment

We are going to work with a dataframe containing the outcomes of the following experiment
Participants play an Ultimatum Game
- Two roles: Proposer and Responder
- The Proposer receives an endowment of € 10 and decides how to divide it between herself and the Responder
- The Responder decides whether to accept the offer or not
- The consequence of Responder choice is manipulated in two experimental treatments
  - Voice+Punishment
    - When the offer is rejected, both the Proposer and the Responder earn 0
    - When the offer is accepted, the division made by the Proposer is implemented and payoffs computed accordingly
  - Voice
    - When the offer is rejected, the division made by the Proposer is implemented and payoffs computed accordingly
    - When the offer is accepted, the division made by the Proposer is implemented and payoffs computed accordingly

Data frame

Session	Treatment	Subject	Role	Group	Offer	Reject
1	Voice+Punishment	1	Proposer	1	2	NA
1	Voice+Punishment	2	Proposer	2	3	NA
1	Voice+Punishment	3	Proposer	3	5	NA
1	Voice+Punishment	4	Proposer	4	2	NA
1	Voice+Punishment	5	Proposer	5	2	NA
1	Voice+Punishment	6	Proposer	6	3	NA
1	Voice+Punishment	7	Proposer	7	0	NA
1	Voice+Punishment	8	Proposer	8	1	NA
1	Voice+Punishment	9	Proposer	9	4	NA
1	Voice+Punishment	10	Proposer	10	2	NA
1	Voice+Punishment	11	Proposer	11	2	NA
1	Voice+Punishment	12	Proposer	12	0	NA
1	Voice+Punishment	13	Proposer	13	2	NA
1	Voice+Punishment	14	Proposer	14	0	NA
1	Voice+Punishment	15	Proposer	15	2	NA
1	Voice+Punishment	16	Proposer	16	2	NA
1	Voice+Punishment	17	Proposer	17	2	NA
1	Voice+Punishment	18	Proposer	18	1	NA
1	Voice+Punishment	19	Proposer	19	0	NA
1	Voice+Punishment	20	Proposer	20	1	NA
1	Voice+Punishment	21	Responder	1	NA	1
1	Voice+Punishment	22	Responder	2	NA	0
1	Voice+Punishment	23	Responder	3	NA	0
1	Voice+Punishment	24	Responder	4	NA	0
1	Voice+Punishment	25	Responder	5	NA	0
1	Voice+Punishment	26	Responder	6	NA	0
1	Voice+Punishment	27	Responder	7	NA	1
1	Voice+Punishment	28	Responder	8	NA	1
1	Voice+Punishment	29	Responder	9	NA	0
1	Voice+Punishment	30	Responder	10	NA	0
1	Voice+Punishment	31	Responder	11	NA	1
1	Voice+Punishment	32	Responder	12	NA	0
1	Voice+Punishment	33	Responder	13	NA	0
1	Voice+Punishment	34	Responder	14	NA	1
1	Voice+Punishment	35	Responder	15	NA	1
1	Voice+Punishment	36	Responder	16	NA	0
1	Voice+Punishment	37	Responder	17	NA	1
1	Voice+Punishment	38	Responder	18	NA	0
1	Voice+Punishment	39	Responder	19	NA	1
1	Voice+Punishment	40	Responder	20	NA	1
2	Voice+Punishment	1	Proposer	1	1	NA
2	Voice+Punishment	2	Proposer	2	2	NA
2	Voice+Punishment	3	Proposer	3	1	NA
2	Voice+Punishment	4	Proposer	4	0	NA
2	Voice+Punishment	5	Proposer	5	5	NA
2	Voice+Punishment	6	Proposer	6	0	NA
2	Voice+Punishment	7	Proposer	7	3	NA
2	Voice+Punishment	8	Proposer	8	2	NA
2	Voice+Punishment	9	Proposer	9	3	NA
2	Voice+Punishment	10	Proposer	10	3	NA
2	Voice+Punishment	11	Proposer	11	2	NA
2	Voice+Punishment	12	Proposer	12	0	NA
2	Voice+Punishment	13	Proposer	13	5	NA
2	Voice+Punishment	14	Proposer	14	4	NA
2	Voice+Punishment	15	Proposer	15	5	NA
2	Voice+Punishment	16	Proposer	16	2	NA
2	Voice+Punishment	17	Proposer	17	2	NA
2	Voice+Punishment	18	Proposer	18	2	NA
2	Voice+Punishment	19	Proposer	19	2	NA
2	Voice+Punishment	20	Proposer	20	2	NA
2	Voice+Punishment	21	Responder	1	NA	0
2	Voice+Punishment	22	Responder	2	NA	0
2	Voice+Punishment	23	Responder	3	NA	1
2	Voice+Punishment	24	Responder	4	NA	1
2	Voice+Punishment	25	Responder	5	NA	0
2	Voice+Punishment	26	Responder	6	NA	1
2	Voice+Punishment	27	Responder	7	NA	0
2	Voice+Punishment	28	Responder	8	NA	1
2	Voice+Punishment	29	Responder	9	NA	0
2	Voice+Punishment	30	Responder	10	NA	0
2	Voice+Punishment	31	Responder	11	NA	1
2	Voice+Punishment	32	Responder	12	NA	0
2	Voice+Punishment	33	Responder	13	NA	0
2	Voice+Punishment	34	Responder	14	NA	0
2	Voice+Punishment	35	Responder	15	NA	0
2	Voice+Punishment	36	Responder	16	NA	1
2	Voice+Punishment	37	Responder	17	NA	1
2	Voice+Punishment	38	Responder	18	NA	1
2	Voice+Punishment	39	Responder	19	NA	1
2	Voice+Punishment	40	Responder	20	NA	1
3	Voice	1	Proposer	1	3	NA
3	Voice	2	Proposer	2	2	NA
3	Voice	3	Proposer	3	0	NA
3	Voice	4	Proposer	4	0	NA
3	Voice	5	Proposer	5	0	NA
3	Voice	6	Proposer	6	0	NA
3	Voice	7	Proposer	7	3	NA
3	Voice	8	Proposer	8	1	NA
3	Voice	9	Proposer	9	0	NA
3	Voice	10	Proposer	10	1	NA
3	Voice	11	Proposer	11	1	NA
3	Voice	12	Proposer	12	0	NA
3	Voice	13	Proposer	13	0	NA
3	Voice	14	Proposer	14	0	NA
3	Voice	15	Proposer	15	0	NA
3	Voice	16	Proposer	16	2	NA
3	Voice	17	Proposer	17	0	NA
3	Voice	18	Proposer	18	1	NA
3	Voice	19	Proposer	19	0	NA
3	Voice	20	Proposer	20	0	NA
3	Voice	21	Responder	1	NA	1
3	Voice	22	Responder	2	NA	1
3	Voice	23	Responder	3	NA	1
3	Voice	24	Responder	4	NA	1
3	Voice	25	Responder	5	NA	1
3	Voice	26	Responder	6	NA	1
3	Voice	27	Responder	7	NA	1
3	Voice	28	Responder	8	NA	1
3	Voice	29	Responder	9	NA	1
3	Voice	30	Responder	10	NA	1
3	Voice	31	Responder	11	NA	1
3	Voice	32	Responder	12	NA	1
3	Voice	33	Responder	13	NA	1
3	Voice	34	Responder	14	NA	1
3	Voice	35	Responder	15	NA	1
3	Voice	36	Responder	16	NA	1
3	Voice	37	Responder	17	NA	1
3	Voice	38	Responder	18	NA	1
3	Voice	39	Responder	19	NA	1
3	Voice	40	Responder	20	NA	1
4	Voice	1	Proposer	1	0	NA
4	Voice	2	Proposer	2	3	NA
4	Voice	3	Proposer	3	2	NA
4	Voice	4	Proposer	4	2	NA
4	Voice	5	Proposer	5	0	NA
4	Voice	6	Proposer	6	0	NA
4	Voice	7	Proposer	7	1	NA
4	Voice	8	Proposer	8	1	NA
4	Voice	9	Proposer	9	0	NA
4	Voice	10	Proposer	10	0	NA
4	Voice	11	Proposer	11	0	NA
4	Voice	12	Proposer	12	0	NA
4	Voice	13	Proposer	13	0	NA
4	Voice	14	Proposer	14	2	NA
4	Voice	15	Proposer	15	0	NA
4	Voice	16	Proposer	16	0	NA
4	Voice	17	Proposer	17	0	NA
4	Voice	18	Proposer	18	0	NA
4	Voice	19	Proposer	19	0	NA
4	Voice	20	Proposer	20	0	NA
4	Voice	21	Responder	1	NA	1
4	Voice	22	Responder	2	NA	1
4	Voice	23	Responder	3	NA	0
4	Voice	24	Responder	4	NA	1
4	Voice	25	Responder	5	NA	0
4	Voice	26	Responder	6	NA	1
4	Voice	27	Responder	7	NA	1
4	Voice	28	Responder	8	NA	1
4	Voice	29	Responder	9	NA	1
4	Voice	30	Responder	10	NA	1
4	Voice	31	Responder	11	NA	1
4	Voice	32	Responder	12	NA	1
4	Voice	33	Responder	13	NA	1
4	Voice	34	Responder	14	NA	0
4	Voice	35	Responder	15	NA	1
4	Voice	36	Responder	16	NA	1
4	Voice	37	Responder	17	NA	1
4	Voice	38	Responder	18	NA	1
4	Voice	39	Responder	19	NA	1
4	Voice	40	Responder	20	NA	1

Mutate

In each Session, Subject and Group are unique identifiers
- Not across sessions!
We need a unique identifier for Subjects and Groups across sessions

d.UG <- d.UG %>%
mutate(Subject=Subject+100*Session, Group=Group+100*Session)
d.UG

  # A tibble: 160 x 7
     Session Treatment        Subject Role     Group Offer Reject
       <dbl> <fct>              <dbl> <fct>    <dbl> <int>  <int>
   1       1 Voice+Punishment     101 Proposer   101     2     NA
   2       1 Voice+Punishment     102 Proposer   102     3     NA
   3       1 Voice+Punishment     103 Proposer   103     5     NA
   4       1 Voice+Punishment     104 Proposer   104     2     NA
   5       1 Voice+Punishment     105 Proposer   105     2     NA
   6       1 Voice+Punishment     106 Proposer   106     3     NA
   7       1 Voice+Punishment     107 Proposer   107     0     NA
   8       1 Voice+Punishment     108 Proposer   108     1     NA
   9       1 Voice+Punishment     109 Proposer   109     4     NA
  10       1 Voice+Punishment     110 Proposer   110     2     NA
  # … with 150 more rows

Mutate (ii)

Treatment and Role are of type character
- More appropriate to have them of type factor

str(d.UG)

  Classes 'tbl_df', 'tbl' and 'data.frame': 160 obs. of  7 variables:
   $ Session  : num  1 1 1 1 1 1 1 1 1 1 ...
   $ Treatment: Factor w/ 2 levels "Voice","Voice+Punishment": 2 2 2 2 2 2 2 2 2 2 ...
   $ Subject  : num  101 102 103 104 105 106 107 108 109 110 ...
   $ Role     : Factor w/ 2 levels "Proposer","Responder": 1 1 1 1 1 1 1 1 1 1 ...
   $ Group    : num  101 102 103 104 105 106 107 108 109 110 ...
   $ Offer    : int  2 3 5 2 2 3 0 1 4 2 ...
   $ Reject   : int  NA NA NA NA NA NA NA NA NA NA ...

d.UG <- d.UG %>%
mutate(Treatment=as.factor(Treatment), Role=as.factor(Role))
str(d.UG)

  Classes 'tbl_df', 'tbl' and 'data.frame': 160 obs. of  7 variables:
   $ Session  : num  1 1 1 1 1 1 1 1 1 1 ...
   $ Treatment: Factor w/ 2 levels "Voice","Voice+Punishment": 2 2 2 2 2 2 2 2 2 2 ...
   $ Subject  : num  101 102 103 104 105 106 107 108 109 110 ...
   $ Role     : Factor w/ 2 levels "Proposer","Responder": 1 1 1 1 1 1 1 1 1 1 ...
   $ Group    : num  101 102 103 104 105 106 107 108 109 110 ...
   $ Offer    : int  2 3 5 2 2 3 0 1 4 2 ...
   $ Reject   : int  NA NA NA NA NA NA NA NA NA NA ...

Select

We are not interested in Session anymore
- We can infer it from the unique identifier of Subject
  - 10x = Session 1, 20x = Session 2, …
We also want to reorder columns in a meaningful way

d.UG <- d.UG %>%
select(Group, Subject, Role, Treatment, Offer, Reject)
d.UG

  # A tibble: 160 x 6
     Group Subject Role     Treatment        Offer Reject
     <dbl>   <dbl> <fct>    <fct>            <int>  <int>
   1   101     101 Proposer Voice+Punishment     2     NA
   2   102     102 Proposer Voice+Punishment     3     NA
   3   103     103 Proposer Voice+Punishment     5     NA
   4   104     104 Proposer Voice+Punishment     2     NA
   5   105     105 Proposer Voice+Punishment     2     NA
   6   106     106 Proposer Voice+Punishment     3     NA
   7   107     107 Proposer Voice+Punishment     0     NA
   8   108     108 Proposer Voice+Punishment     1     NA
   9   109     109 Proposer Voice+Punishment     4     NA
  10   110     110 Proposer Voice+Punishment     2     NA
  # … with 150 more rows

Rename

In our experiment, we want to clearly indicate that Subject and Group are unique identifiers

d.UG <- d.UG %>%
rename(Group.ID=Group,Subject.ID=Subject)
d.UG

  # A tibble: 160 x 6
     Group.ID Subject.ID Role     Treatment        Offer Reject
        <dbl>      <dbl> <fct>    <fct>            <int>  <int>
   1      101        101 Proposer Voice+Punishment     2     NA
   2      102        102 Proposer Voice+Punishment     3     NA
   3      103        103 Proposer Voice+Punishment     5     NA
   4      104        104 Proposer Voice+Punishment     2     NA
   5      105        105 Proposer Voice+Punishment     2     NA
   6      106        106 Proposer Voice+Punishment     3     NA
   7      107        107 Proposer Voice+Punishment     0     NA
   8      108        108 Proposer Voice+Punishment     1     NA
   9      109        109 Proposer Voice+Punishment     4     NA
  10      110        110 Proposer Voice+Punishment     2     NA
  # … with 150 more rows

Filter

Select only the data that refer to the Proposer and get rid of column Reject which is empty for the Proposer

d.UG %>%
filter(Role=="Proposer") %>%
select(-Reject)

  # A tibble: 80 x 5
     Group.ID Subject.ID Role     Treatment        Offer
        <dbl>      <dbl> <fct>    <fct>            <int>
   1      101        101 Proposer Voice+Punishment     2
   2      102        102 Proposer Voice+Punishment     3
   3      103        103 Proposer Voice+Punishment     5
   4      104        104 Proposer Voice+Punishment     2
   5      105        105 Proposer Voice+Punishment     2
   6      106        106 Proposer Voice+Punishment     3
   7      107        107 Proposer Voice+Punishment     0
   8      108        108 Proposer Voice+Punishment     1
   9      109        109 Proposer Voice+Punishment     4
  10      110        110 Proposer Voice+Punishment     2
  # … with 70 more rows

Filter (ii)

Select only the data that refer to the Proposer and get rid of column Offer which is empty for the Responder

d.UG %>% filter(Role=="Responder") %>%
select(-Offer)

  # A tibble: 80 x 5
     Group.ID Subject.ID Role      Treatment        Reject
        <dbl>      <dbl> <fct>     <fct>             <int>
   1      101        121 Responder Voice+Punishment      1
   2      102        122 Responder Voice+Punishment      0
   3      103        123 Responder Voice+Punishment      0
   4      104        124 Responder Voice+Punishment      0
   5      105        125 Responder Voice+Punishment      0
   6      106        126 Responder Voice+Punishment      0
   7      107        127 Responder Voice+Punishment      1
   8      108        128 Responder Voice+Punishment      1
   9      109        129 Responder Voice+Punishment      0
  10      110        130 Responder Voice+Punishment      0
  # … with 70 more rows

Join

We want to create a database in which each row collects the observations within a Group
- Both the offer and the reject choice

d.UG.2 <- full_join(
  d.UG %>% filter(Role=="Proposer") %>% select(-Reject),
  d.UG %>% filter(Role=="Responder") %>% select(-Offer),
  by="Group.ID"
) %>%
select(Group.ID,Treatment.x,Offer,Reject) %>%
rename(Treatment=Treatment.x)

d.UG.2

  # A tibble: 80 x 4
     Group.ID Treatment        Offer Reject
        <dbl> <fct>            <int>  <int>
   1      101 Voice+Punishment     2      1
   2      102 Voice+Punishment     3      0
   3      103 Voice+Punishment     5      0
   4      104 Voice+Punishment     2      0
   5      105 Voice+Punishment     2      0
   6      106 Voice+Punishment     3      0
   7      107 Voice+Punishment     0      1
   8      108 Voice+Punishment     1      1
   9      109 Voice+Punishment     4      0
  10      110 Voice+Punishment     2      0
  # … with 70 more rows

Spread

The spreading results is untidy data
- However, it can be useful as intermediate step
- e.g., to compute payoffs

d.UG.2 %>%
gather("Key","Value",c(Offer,Reject))

  # A tibble: 160 x 4
     Group.ID Treatment        Key   Value
        <dbl> <fct>            <chr> <int>
   1      101 Voice+Punishment Offer     2
   2      102 Voice+Punishment Offer     3
   3      103 Voice+Punishment Offer     5
   4      104 Voice+Punishment Offer     2
   5      105 Voice+Punishment Offer     2
   6      106 Voice+Punishment Offer     3
   7      107 Voice+Punishment Offer     0
   8      108 Voice+Punishment Offer     1
   9      109 Voice+Punishment Offer     4
  10      110 Voice+Punishment Offer     2
  # … with 150 more rows

Thank you

🦖

To contact me just write me an email

matteo.ploner@unitn.it

or write me on the forum of the course

Appendix

Assignments

Assignment 1

Create these two datasets

  # A tibble: 5 x 4
    Outcome     a     b     c
      <int> <int> <int> <dbl>
  1       1     1     5     1
  2       2     2     4     3
  3       3     3     3     5
  4       4     4     2     7
  5       5     5     1     9

  # A tibble: 5 x 4
    Outcome     d     e     f
      <int> <int> <int> <dbl>
  1       1     5     1     1
  2       2     4     2     5
  3       3     3     3     9
  4       4     2     4    13
  5       5     1     5    17

Join them in a unique dataset using Outcome
Create a new column sum_ad with the sum of a and f
Create a new dataset looking like this

  # A tibble: 3 x 4
    Outcome     a     f sum_ad
      <int> <int> <dbl>  <dbl>
  1       1     1     1      2
  2       3     3     9     12
  3       5     5    17     22

Assignment 2

Take the data from the UG (included here) and compute the payoffs in the game

Download UG.csv

References

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".

A condensed online version available here.↩

R: Data Wrangling

Data Wrangling and Data Representation in R markdown"

Matteo Ploner

Introduction

Data Wrangling

Tidy data

Example of tidy data

Example of untidy data

Data wrangling

Coding Style

Mutate

Select

Rename

Arrange

Filter

Join: full

Join: full (ii)

Join: partial

Gathering

Spread

An Application to Experimental Data

The Experiment

Data frame

Mutate

Mutate (ii)

Select

Rename

Filter

Filter (ii)

Join

Spread

Thank you

Appendix

Assignments

Assignment 1

Assignment 2

References