Python | Data Wrangling

An introduction to data management and manipulation in Python (1/3)
Author

Matteo Ploner (University of Trento, Italy)

Published

September 4, 2024

This notebook illustrates a simple data workflow using Python. The workflow includes the following steps:

Module loading

Import modules for data manipulation and visualization.

  • Pandas: library for data manipulation and analysis
  • Numpy: library for numerical computing
  • Seaborn: library for data visualization
import pandas as pd
import seaborn as sns
import numpy as np

Data Loading

Import a demo dataset from Seaborn library (the “famous” iris dataset) to illustrate the workflow.

A set of 150 records of flowers attributes taken by Ronald Fisher in 1936.

  • 50 samples for three species of iris: Iris setosa, Iris virginica and Iris versicolor
  • Five attributes: sepal length, sepal width, petal length, petal width and species.
# read csv from external source
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# or, alternatively, use the built-in dataset
iris = sns.load_dataset("iris", cache=True, data_home=None)
iris
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

The dataset is tidy, with each variable forming a column, each observation forming a row

  • First, we rename the columns to make them more readable
iris.rename(columns={'sepal_length': 'Sepal Length', 'sepal_width': 'Sepal Width', 'petal_length': 'Petal Length', 'petal_width': 'Petal Width', 'species': 'Species'}, inplace=True)
iris.head()
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
  • Next, we reorder the columns by Sepal Length
iris.sort_values(by='Sepal Length', ascending=False).head()
Sepal Length Sepal Width Petal Length Petal Width Species
131 7.9 3.8 6.4 2.0 virginica
135 7.7 3.0 6.1 2.3 virginica
122 7.7 2.8 6.7 2.0 virginica
117 7.7 3.8 6.7 2.2 virginica
118 7.7 2.6 6.9 2.3 virginica

Data Subsetting

Subsetting is the process of creating a new dataset from an existing dataset based on some criteria.

Subsetting can be done by rows (observations), columns (variables) or both.

Subsetting rows

Sub-setting rows by index

  • Just pick the first 5 rows of the dataset
iris.iloc[0:5]
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Select randomly

  • Select 5 random rows from the dataset
iris.sample(5)
Sepal Length Sepal Width Petal Length Petal Width Species
56 6.3 3.3 4.7 1.6 versicolor
96 5.7 2.9 4.2 1.3 versicolor
10 5.4 3.7 1.5 0.2 setosa
144 6.7 3.3 5.7 2.5 virginica
115 6.4 3.2 5.3 2.3 virginica

Select according to a condition

  • Select only rows with largest (smallest) petal length
iris.nlargest(5, 'Sepal Length')
Sepal Length Sepal Width Petal Length Petal Width Species
131 7.9 3.8 6.4 2.0 virginica
117 7.7 3.8 6.7 2.2 virginica
118 7.7 2.6 6.9 2.3 virginica
122 7.7 2.8 6.7 2.0 virginica
135 7.7 3.0 6.1 2.3 virginica
iris.nsmallest(5, 'Sepal Length')
Sepal Length Sepal Width Petal Length Petal Width Species
13 4.3 3.0 1.1 0.1 setosa
8 4.4 2.9 1.4 0.2 setosa
38 4.4 3.0 1.3 0.2 setosa
42 4.4 3.2 1.3 0.2 setosa
41 4.5 2.3 1.3 0.3 setosa

Sub-setting based on column values

  • We create a subset of the dataset with only the species “Iris setosa”.
iris[iris.Species == 'setosa'].head()
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Sub-setting cols

Select columns by name

  • Select only the columns “Sepal Length” and “Species”
iris[['Sepal Length', 'Species']]
Sepal Length Species
0 5.1 setosa
1 4.9 setosa
2 4.7 setosa
3 4.6 setosa
4 5.0 setosa
... ... ...
145 6.7 virginica
146 6.3 virginica
147 6.5 virginica
148 6.2 virginica
149 5.9 virginica

150 rows × 2 columns

Select columns by index

  • Select only the first three columns
iris.iloc[:,[1,2,3]]
Sepal Width Petal Length Petal Width
0 3.5 1.4 0.2
1 3.0 1.4 0.2
2 3.2 1.3 0.2
3 3.1 1.5 0.2
4 3.6 1.4 0.2
... ... ... ...
145 3.0 5.2 2.3
146 2.5 5.0 1.9
147 3.0 5.2 2.0
148 3.4 5.4 2.3
149 3.0 5.1 1.8

150 rows × 3 columns

Select columns by regular expression

  • Select only columns that contain the word “Length”
iris.filter(regex='Length$')
Sepal Length Petal Length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
3 4.6 1.5
4 5.0 1.4
... ... ...
145 6.7 5.2
146 6.3 5.0
147 6.5 5.2
148 6.2 5.4
149 5.9 5.1

150 rows × 2 columns

Sub-setting rows and columns jointly

  • Select only the rows for which the column Species is “setosa” and the columns “Sepal Length” and “Species”
iris.loc[iris.Species == 'setosa',['Sepal Length', 'Species']].head()
Sepal Length Species
0 5.1 setosa
1 4.9 setosa
2 4.7 setosa
3 4.6 setosa
4 5.0 setosa

Data Reshaping

Data reshaping is the process of transforming data from one format into another.

The dataset is in a wide format, with each row representing a single observation and each column representing a variable.

iris.head()
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

From wide to long format

  • Convert the dataset from wide to long format
    • In the long format, different variables are “collapsed” into two variables: one for the variable names and one for the variable values.
iris_melt = iris.melt(
  id_vars='Species',
  value_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
  var_name='Measurement',
  value_name='Value_cm',
)
iris_melt
Species Measurement Value_cm
0 setosa Sepal Length 5.1
1 setosa Sepal Length 4.9
2 setosa Sepal Length 4.7
3 setosa Sepal Length 4.6
4 setosa Sepal Length 5.0
... ... ... ...
595 virginica Petal Width 2.3
596 virginica Petal Width 1.9
597 virginica Petal Width 2.0
598 virginica Petal Width 2.3
599 virginica Petal Width 1.8

600 rows × 3 columns

From long to wide format

! Duplicated entries: for a species there are several entries for the same attribute (e.g., sepal length) Solution: each row is a unique combination of species and attribute. reset:index() method is used to reset the index of the DataFrame.

iris_melt = iris.reset_index().melt(
  id_vars=['index', 'Species'],
  value_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
  var_name='Measurement',
  value_name='Value_cm',
)

iris_melt.pivot(
   index=['index', 'Species'],
   columns='Measurement',
   values='Value_cm'
).reset_index().drop(columns=['index'])
Measurement Species Petal Length Petal Width Sepal Length Sepal Width
0 setosa 1.4 0.2 5.1 3.5
1 setosa 1.4 0.2 4.9 3.0
2 setosa 1.3 0.2 4.7 3.2
3 setosa 1.5 0.2 4.6 3.1
4 setosa 1.4 0.2 5.0 3.6
... ... ... ... ... ...
145 virginica 5.2 2.3 6.7 3.0
146 virginica 5.0 1.9 6.3 2.5
147 virginica 5.2 2.0 6.5 3.0
148 virginica 5.4 2.3 6.2 3.4
149 virginica 5.1 1.8 5.9 3.0

150 rows × 5 columns

Append data

  • Append datasets to each other

Append rows

  • Append datasets to each other with columns aligned by name
pd.concat(
[iris[iris.Species == 'setosa'],
iris[iris.Species == 'virginica']
]
)
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

100 rows × 5 columns

Append side by side

pd.concat(
[iris[iris.Species == 'setosa'],
iris[iris.Species == 'virginica']
],
axis=1
)
Sepal Length Sepal Width Petal Length Petal Width Species Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa NaN NaN NaN NaN NaN
1 4.9 3.0 1.4 0.2 setosa NaN NaN NaN NaN NaN
2 4.7 3.2 1.3 0.2 setosa NaN NaN NaN NaN NaN
3 4.6 3.1 1.5 0.2 setosa NaN NaN NaN NaN NaN
4 5.0 3.6 1.4 0.2 setosa NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
145 NaN NaN NaN NaN NaN 6.7 3.0 5.2 2.3 virginica
146 NaN NaN NaN NaN NaN 6.3 2.5 5.0 1.9 virginica
147 NaN NaN NaN NaN NaN 6.5 3.0 5.2 2.0 virginica
148 NaN NaN NaN NaN NaN 6.2 3.4 5.4 2.3 virginica
149 NaN NaN NaN NaN NaN 5.9 3.0 5.1 1.8 virginica

100 rows × 10 columns

Merge data

  • Merge datasets to each other based on a common column
A = iris.reset_index().iloc[0:5, 0:2]
A
index Sepal Length
0 0 5.1
1 1 4.9
2 2 4.7
3 3 4.6
4 4 5.0
B = iris.reset_index().iloc[2:7, [0, 3,4, 5]]
B
index Petal Length Petal Width Species
2 2 1.3 0.2 setosa
3 3 1.5 0.2 setosa
4 4 1.4 0.2 setosa
5 5 1.7 0.4 setosa
6 6 1.4 0.3 setosa

Dataset A and B are merged based on the column “index”

Inner join

  • Only the intersection of the keys is returned
pd.merge(
A,
B,
how='inner',
on='index'
)
index Sepal Length Petal Length Petal Width Species
0 2 4.7 1.3 0.2 setosa
1 3 4.6 1.5 0.2 setosa
2 4 5.0 1.4 0.2 setosa

Outer join

  • All keys are returned
pd.merge(
A,
B,
how='outer',
on='index'
)
index Sepal Length Petal Length Petal Width Species
0 0 5.1 NaN NaN NaN
1 1 4.9 NaN NaN NaN
2 2 4.7 1.3 0.2 setosa
3 3 4.6 1.5 0.2 setosa
4 4 5.0 1.4 0.2 setosa
5 5 NaN 1.7 0.4 setosa
6 6 NaN 1.4 0.3 setosa

Left join

  • All keys in the left DataFrame are returned
pd.merge(
A,
B,
how='left',
on='index'
)
index Sepal Length Petal Length Petal Width Species
0 0 5.1 NaN NaN NaN
1 1 4.9 NaN NaN NaN
2 2 4.7 1.3 0.2 setosa
3 3 4.6 1.5 0.2 setosa
4 4 5.0 1.4 0.2 setosa

Write data to external file

Write datasets to external files is useful to work with other tools or to share data with others.

CSV is a common format for data exchange.

iris.to_csv('./iris.csv', index=False)

but many other formats are available (e.g., Excel, JSON, SQL, etc.)

import xlsxwriter

iris.to_excel('./iris.xlsx', engine='xlsxwriter', index=False)

Sources

Books

  • McKinney, W. (2013). Python for data analysis. ” O’Reilly Media, Inc.”.
  • O’Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. ” O’Reilly Media, Inc.”

Websites