Introduction
References
- This course illustrates techniques for data manipulation, visualization and reporting using R and R Markdown
- Reference to the following sources is made during the course
- Chang, Winston. 2012. R Graphics Cookbook: Practical Recipes for Visualizing Data. " O’Reilly Media, Inc.".
- Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".
- W. N. Venables, W.N., Smith D. M., and R Core Team. 2019. An Introduction to R.
- Reference to the following sources is made during the course
- Many useful resources can be found online
- e.g., stackoverflow
R
Description
- From https://cran.r-project.org/
- R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
- R can be regarded as an implementation of the S language which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the of the S-PLUS systems.
R version 3.6.1 (2019-07-05) – “Action of the Toes” Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors. Type ‘contributors()’ for more information and ‘citation()’ on how to cite R or R packages in publications.
Type ‘demo()’ for some demos, ‘help()’ for on-line help, or ‘help.start()’ for an HTML browser interface to help. Type ‘q()’ to quit R.
RStudio
Description
- From https://www.rstudio.com/products/RStudio/
- RStudio is an integrated development environment (IDE) for R
- RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).
The Interface
Basic data manipulations
Vectors
- R operates on named data structures.
The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.
- Generate a vector of decreasing values from 10 to 0 in steps of 2
- use the function \(c()\) to “concatenate” values
[1] 10 8 6 4 2 0
Subsetting
- Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets.
[1] 8
[1] 8 6 4
Vector arithmetic
- Arithmetic with vectors
[1] 100 64 36 16 4 0
[1] 20 16 12 8 4 0
[1] 5
Vector arithmetic (ii)
- Arithmetic with vectors
[1] 3.741657
[1] TRUE TRUE TRUE TRUE FALSE FALSE
[1] NaN
[1] 0 0 0 0 0 0
Vector types
- The numeric vector is just one of alternative data types in R
- Integer number
[1] 10 8 6 4 2 0
- Double precision number
[1] 10 8 6 4 2 0
- Complex
[1] 10+0i 8+0i 6+0i 4+0i 2+0i 0+0i
Vector types (ii)
- Factor
[1] 10 8 6 4 2 0
Levels: 0 2 4 6 8 10
- Ordinal values
[1] 10 8 6 4 2 0
Levels: 0 < 2 < 4 < 6 < 8 < 10
- Date
[1] "2020-05-07" "2020-05-05" "2020-05-03" "2020-05-01" "2020-04-29"
[6] "2020-04-27"
Vector types (iii)
- List
- A list is a generic vector containing other objects.
[[1]]
[1] 10 8 6 4 2 0
[[2]]
[1] 100 64 36 16 4 0
- Extract an object of the list
[[1]]
[1] 10 8 6 4 2 0
[[1]]
[1] 100 64 36 16 4 0
- Extract objects of the list as vectors
[1] 10 8 6 4 2 0
[1] 100 64 36 16 4 0
Arrays and Matrices
Array
An array can be considered as a multiply subscripted collection of data entries, for example numeric.
- A dimension vector is a vector of non-negative integers.
- If its length is k then the array is k-dimensional
- A matrix is a n-dimensional array
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
, , 2
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Data frame
- A convenient format of matrix is the data frame
- Two-dimension matrix (rectangular data)
V1 V2 V3 V4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
Extract elements
- To extract elements from the array use coordinates in the form [row,col]
- Extract elements in row 2 and column 3 (=8)
[1] 8
- Extract row 2
V1 V2 V3 V4
2 2 5 8 11
- Extract column 4
[1] 10 11 12
Extract elements (ii)
- Extract column
V2
1 4
2 5
3 6
- Extract column as vector
[1] 4 5 6
- Conditional
[1] 11 12
Rename cols and rows
Col1 Col2 Col3 Col4
Row1 1 4 7 10
Row2 2 5 8 11
Row3 3 6 9 12
- Use names to retrieve values
[1] 8
- Alternative notation
[1] 7 8 9
Operations on arrays
- Sum
[1] 78
- Sum by columns
Col1 Col2 Col3 Col4
6 15 24 33
- Sum by rows
Row1 Row2 Row3
22 26 30
Operations on arrays (ii)
- Transpose rows and cols
Col1 Col2 Col3 Col4
Row1 1 4 7 10
Row2 2 5 8 11
Row3 3 6 9 12
Row1 Row2 Row3
Col1 1 2 3
Col2 4 5 6
Col3 7 8 9
Col4 10 11 12
Add remove cols and rows
- Add a column
Col1 Col2 Col3 Col4 Col5
Row1 1 4 7 10 -99
Row2 2 5 8 11 -99
Row3 3 6 9 12 -99
- Remove a column
Col1 Col2 Col3 Col4
Row1 1 4 7 10
Row2 2 5 8 11
Row3 3 6 9 12
Import data
- Import data from an external source
- Common format is .csv
- Values in a row are separated by a comma
- Common format is .csv
Col1.Col2.Col3.Col4
1 1;4;7;10
2 2;5;8;11
3 3;6;9;12
- need to specify the correct separator
Col1 Col2 Col3 Col4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
Tibbles
- In the following we are going to use a special form of data frames: tibble(Wickham and Grolemund 2016)
- In library tidverse
- Refined print method that shows only the first 10 rows, and all the columns that fit on screen.
# A tibble: 3 x 4
Col1 Col2 Col3 Col4
<int> <int> <int> <int>
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
Tibbles (ii)
- To convert a traditional data frame into a tibble
# A tibble: 3 x 4
Col1 Col2 Col3 Col4
<int> <int> <int> <int>
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
- To control how many rows to print
# A tibble: 3 x 4
Col1 Col2 Col3 Col4
<int> <int> <int> <int>
1 1 4 7 10
# … with 2 more rows
Appendix
Assignments
Assignment 1
- Take vector d as defined above
- Sum up all the values in the vector
- Transform the data type of d from integer to factor
- Which is the value in position 4?
- Sum up all the values in the vector
- Transform the data type of d from factor to integer
- Which is the value in position 4?
- Transform the data type of d from integer to character
- Which is the value in position 4?
- Transform the data type of d from character to integer
- Which is the value in position 4?
- Take vector d as defined above and extend it to go to -10 in steps of -2
Assignment 2
- Create a tibble that replicates the following dataset
Column_1 | Column_2 | Column_3 | Column_4 |
---|---|---|---|
1 | 2 | 7 | 8 |
3 | 4 | 9 | 10 |
5 | 6 | 11 | 12 |
- Extract column “Column_2” and transform the content from integers to factors
- Replace the factors into the original “Column_2”
- Set the max print to 2 rows
References
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".