R: Data Description

Data Wrangling and Data Representation in R

Matteo Ploner

Descriptive statistics

Sum of two dice

  • Take the sum of tossing two dice (x50)
    • The theoretical distribution

Count

  • Count the occurences of values
    • Possible outcomes are 2,3, … 12
  • count()
    • returns a new column n
  # A tibble: 11 x 2
       Sum     n
     <int> <int>
   1     2     1
   2     3     6
   3     4     7
   4     5     5
   5     6     4
   6     7     7
   7     8     7
   8     9     3
   9    10     3
  10    11     5
  11    12     2

Relative frequencies

  • We compute the relative frequencies
    • prop.table()
  # A tibble: 11 x 3
       Sum     n  freq
     <int> <int> <dbl>
   1     2     1  0.02
   2     3     6  0.12
   3     4     7  0.14
   4     5     5  0.1 
   5     6     4  0.08
   6     7     7  0.14
   7     8     7  0.14
   8     9     3  0.06
   9    10     3  0.06
  10    11     5  0.1 
  11    12     2  0.04

Group_by

  • Descriptive statistics can be computed for subgroups
    • Use variable to condition the computation
    • As an example, frequency of odd values
  # A tibble: 2 x 3
    Nature     n  freq
    <chr>  <int> <dbl>
  1 Even      24  0.48
  2 Odd       26  0.52
  • use ungroup() to remove the grouping structure
    • What happens if we do not remove the grouping?

Distribution measures

  • Mean and standard deviations are provide a measure of central tendency and dispersion of the data
    • Median is another, robust, measure of central tendency

Summarise_at

  • Function summarise_at("") allows to compute several statistics for a variable of interest
    • list(~n(),~mean, ~median, ~sd …)
  # A tibble: 1 x 6
        N  Mean Median    SD   Min   Max
    <int> <dbl>  <dbl> <dbl> <int> <int>
  1    50  6.76      7  2.79     2    12
N Mean Median SD Min Max
50 6.76 7 2.789 2 12
  • Alternatively, one can also use summarise and specify the column
  # A tibble: 1 x 6
        n  Mean Median    SD   Min   Max
    <int> <dbl>  <dbl> <dbl> <int> <int>
  1    50  6.76      7  2.79     2    12

Tables

A table

  • A table contains pieces of information organized by columns and rows
    • Two-dimension coordinates
Column_1 Column_2 Column_3
Row 1 1, 1 1, 2 1, 3
Row 2 2, 1 2, 2 2, 3
Row 3 3, 1 3, 2 3, 3

Table output: html

  • A simple table output can be obtained by invoking its name
  # A tibble: 1 x 6
        N  Mean Median    SD   Min   Max
    <int> <dbl>  <dbl> <dbl> <int> <int>
  1    50  6.76      7  2.79     2    12
  • The libraries knitr and kableExtra deliver nicer results!
Summary statistics
N Mean Median SD Min Max
50 6.76 7 2.789 2 12
  • You can easily get the latex code of your table for yoru papers
    • format=“latex”

Graphical tables

  • Information contained in tables can be complemented with graphical support
    • library(formattable)
  • As an example, add a bar whose legnth is proportional to the cell content
Toss Out_1 Out_2 Sum
1 3 5 8
2 4 5 9
3 4 6 10
4 1 5 6
5 4 4 8
6 6 1 7
7 1 6 7
8 1 4 5
9 1 3 4
10 2 3 5

Graphical representation

A Grammar for graphics

  • Wickham (2010) defines a layered grammar of graphics
    • Buiding a graph from multiple layers of data
  • The main layers of a graph are
    • data and aesthetic mappings
    • geometry objects
    • scales
    • facet specification
  • In addition we may have
    • statistical transformations
    • coordinate system

ggplot

  • We rely on the library ggplot2 to provide a graphical representation of data
    • Provide a data frame in the form of a tibble
    • Define the aesthetic mapping gathered from the data frame
      • x: x-dimension
      • y: y-dimension
      • fill: color to fill the graph
      • color: color of the graph
      • size: dimension of graph elements
      • label: labels in the graph

ggplot(data=“DATA”, mapping=aes(x=“X”,y=“Y”,fill=“FILL”))

  • This “sets the ground” for the graph
    • Still need to specify the exact “geometry” of the graph
  • ! mapping can be defined also at the geometry level

ggplot (ii)

  • Define the graph “canvas” p
    • On x-axis,the sequence of tosses
    • On y-axis, the outcome (sum)

Geometry

  • Geometries can be added to the canvas as layers
  • Common geometry functions
    • geom_point()
    • geom_line()
    • geom_smooth()
    • geom_bar()
    • geom_area()
    • ….

Different kind of lines

  • linetype controls the style of line
    • Can be used inside aes(linetype=)

Different kind of points

  • pch controls the style of points
    • Can be used inside aes(pch=)

Fitting

  • Add a smooting function
    • For \(N<1000\) the loess function is used
      • Local Polynomial Regression Fitting
    • Others can be spcified, e.g. method=lm for linear fitting

Axes

  • Axes provide a guide to read the graph
  • Possible to control
    • axis dimensions
    • axis type
    • tick marks
    • tick mark labels

Non-graphical elements: theme

  • You can easily control the size, the orientation, and the color of non-graphical elements with theme
    • Axis
      • axis.title, axis.text, axis.ticks, legend.key, legend.key.size …
    • Legend
      • legend.background, legend.margin, legend.spacing, legend.key.height, legend.key.width, legend.text, legend.text.align, legend.title, legend.position …
    • Facets
      • strip.background, strip.text …

Colors

  • Colours can be used to fill a geometric element (e.g., bars, points) or to define its color (e.g., points, lines, …)
  • Colours can also be used to map variables to colors
    • aes(… color=VAR, fill=VAR)
  • Fill and color for a discrete variable
    • scale_fill_brewer() or scale_color_brewer() to use library(RColorBrewer) palettes
    • scale_fill_manual() or scale_color_manual() to manually define colors
  • Fill and color for a continuous variable
  • scale_fill_gradient() or scale_color_gradient() two-color gradient
  • scale_fill_gradientn() or scale_color_gradientn() n-color gradient, equally spaced

RColorBrewer palettes