R: Introduction

Data Wrangling and Data Representation in R markdown

Matteo Ploner

Introduction

References

  • This course illustrates techniques for data manipulation, visualization and reporting using R and R Markdown
    • Reference to the following sources is made during the course
      • Chang, Winston. 2012. R Graphics Cookbook: Practical Recipes for Visualizing Data. " O’Reilly Media, Inc.".
      • Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".
      • W. N. Venables, W.N., Smith D. M., and R Core Team. 2019. An Introduction to R.
  • Many useful resources can be found online

R

Description

  • From https://cran.r-project.org/
    • R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
    • R can be regarded as an implementation of the S language which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the of the S-PLUS systems.

R version 3.6.1 (2019-07-05) – “Action of the Toes” Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type ‘contributors()’ for more information and ‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or ‘help.start()’ for an HTML browser interface to help. Type ‘q()’ to quit R.

RStudio

Description

  • From https://www.rstudio.com/products/RStudio/
    • RStudio is an integrated development environment (IDE) for R
    • RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).

The Interface

Basic data manipulations

Vectors

  • R operates on named data structures.
  • The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.

  • Generate a vector of decreasing values from 10 to 0 in steps of 2
    • use the function \(c()\) to “concatenate” values
  [1] 10  8  6  4  2  0

Assignment

  • Vectors can be assigned using symbol \(<-\)
  [1] 10  8  6  4  2  0

Subsetting

  • Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets.
  [1] 8
  [1] 8 6 4

Vector arithmetic

  • Arithmetic with vectors
  [1] 100  64  36  16   4   0
  [1] 20 16 12  8  4  0
  [1] 5

Vector arithmetic (ii)

  • Arithmetic with vectors
  [1] 3.741657
  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE
  [1] NaN
  [1] 0 0 0 0 0 0

Vector types

  • The numeric vector is just one of alternative data types in R
  • Integer number
  [1] 10  8  6  4  2  0
  • Double precision number
  [1] 10  8  6  4  2  0
  • Complex
  [1] 10+0i  8+0i  6+0i  4+0i  2+0i  0+0i

Vector types (ii)

  • Factor
  [1] 10 8  6  4  2  0 
  Levels: 0 2 4 6 8 10
  • Ordinal values
  [1] 10 8  6  4  2  0 
  Levels: 0 < 2 < 4 < 6 < 8 < 10
  • Date
  [1] "2020-05-07" "2020-05-05" "2020-05-03" "2020-05-01" "2020-04-29"
  [6] "2020-04-27"

Vector types (iii)

  • List
    • A list is a generic vector containing other objects.
  [[1]]
  [1] 10  8  6  4  2  0
  
  [[2]]
  [1] 100  64  36  16   4   0
  • Extract an object of the list
  [[1]]
  [1] 10  8  6  4  2  0
  [[1]]
  [1] 100  64  36  16   4   0
  • Extract objects of the list as vectors
  [1] 10  8  6  4  2  0
  [1] 100  64  36  16   4   0

Arrays and Matrices

Array

  • An array can be considered as a multiply subscripted collection of data entries, for example numeric.

  • A dimension vector is a vector of non-negative integers.
    • If its length is k then the array is k-dimensional
    • A matrix is a n-dimensional array
  , , 1
  
       [,1] [,2] [,3] [,4]
  [1,]    1    4    7   10
  [2,]    2    5    8   11
  [3,]    3    6    9   12
  
  , , 2
  
       [,1] [,2] [,3] [,4]
  [1,]    1    4    7   10
  [2,]    2    5    8   11
  [3,]    3    6    9   12

Data frame

  • A convenient format of matrix is the data frame
    • Two-dimension matrix (rectangular data)
    V1 V2 V3 V4
  1  1  4  7 10
  2  2  5  8 11
  3  3  6  9 12

Extract elements

  • To extract elements from the array use coordinates in the form [row,col]
  • Extract elements in row 2 and column 3 (=8)
  [1] 8
  • Extract row 2
    V1 V2 V3 V4
  2  2  5  8 11
  • Extract column 4
  [1] 10 11 12

Extract elements (ii)

  • Extract column
    V2
  1  4
  2  5
  3  6
  • Extract column as vector
  [1] 4 5 6
  • Conditional
  [1] 11 12

Rename cols and rows

       Col1 Col2 Col3 Col4
  Row1    1    4    7   10
  Row2    2    5    8   11
  Row3    3    6    9   12
  • Use names to retrieve values
  [1] 8
  • Alternative notation
  [1] 7 8 9

Operations on arrays

  • Sum
  [1] 78
  • Sum by columns
  Col1 Col2 Col3 Col4 
     6   15   24   33
  • Sum by rows
  Row1 Row2 Row3 
    22   26   30

Operations on arrays (ii)

  • Transpose rows and cols
       Col1 Col2 Col3 Col4
  Row1    1    4    7   10
  Row2    2    5    8   11
  Row3    3    6    9   12
       Row1 Row2 Row3
  Col1    1    2    3
  Col2    4    5    6
  Col3    7    8    9
  Col4   10   11   12

Add remove cols and rows

  • Add a column
       Col1 Col2 Col3 Col4 Col5
  Row1    1    4    7   10  -99
  Row2    2    5    8   11  -99
  Row3    3    6    9   12  -99
  • Remove a column
       Col1 Col2 Col3 Col4
  Row1    1    4    7   10
  Row2    2    5    8   11
  Row3    3    6    9   12

Import data

  • Import data from an external source
    • Common format is .csv
      • Values in a row are separated by a comma
    Col1.Col2.Col3.Col4
  1            1;4;7;10
  2            2;5;8;11
  3            3;6;9;12
  • need to specify the correct separator
    Col1 Col2 Col3 Col4
  1    1    4    7   10
  2    2    5    8   11
  3    3    6    9   12

Tibbles

  • In the following we are going to use a special form of data frames: tibble(Wickham and Grolemund 2016)
  • In library tidverse
  • Refined print method that shows only the first 10 rows, and all the columns that fit on screen.
  # A tibble: 3 x 4
     Col1  Col2  Col3  Col4
    <int> <int> <int> <int>
  1     1     4     7    10
  2     2     5     8    11
  3     3     6     9    12

Tibbles (ii)

  • To convert a traditional data frame into a tibble
  # A tibble: 3 x 4
     Col1  Col2  Col3  Col4
    <int> <int> <int> <int>
  1     1     4     7    10
  2     2     5     8    11
  3     3     6     9    12
  • To control how many rows to print
  # A tibble: 3 x 4
     Col1  Col2  Col3  Col4
    <int> <int> <int> <int>
  1     1     4     7    10
  # … with 2 more rows

Appendix

Assignments

Assignment 1

  1. Take vector d as defined above
    • Sum up all the values in the vector
    • Transform the data type of d from integer to factor
      • Which is the value in position 4?
      • Sum up all the values in the vector
    • Transform the data type of d from factor to integer
      • Which is the value in position 4?
    • Transform the data type of d from integer to character
      • Which is the value in position 4?
    • Transform the data type of d from character to integer
      • Which is the value in position 4?
  2. Take vector d as defined above and extend it to go to -10 in steps of -2

Assignment 2

  • Create a tibble that replicates the following dataset
Column_1 Column_2 Column_3 Column_4
1 2 7 8
3 4 9 10
5 6 11 12
  • Extract column “Column_2” and transform the content from integers to factors
  • Replace the factors into the original “Column_2”
  • Set the max print to 2 rows

References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".