Python | Data Wrangling

An introduction to data management and manipulation in Python (1/3)

Author

Matteo Ploner (University of Trento, Italy)

Published

September 4, 2024

This notebook illustrates a simple data workflow using Python. The workflow includes the following steps:

Data loading
Data subsetting
Data transformation
- Data shaping
- Data merging
Data export

Module loading

Import modules for data manipulation and visualization.

Pandas: library for data manipulation and analysis
Numpy: library for numerical computing
Seaborn: library for data visualization

import pandas as pd
import seaborn as sns
import numpy as np

Data Loading

Import a demo dataset from Seaborn library (the “famous” iris dataset) to illustrate the workflow.

A set of 150 records of flowers attributes taken by Ronald Fisher in 1936.

50 samples for three species of iris: Iris setosa, Iris virginica and Iris versicolor
Five attributes: sepal length, sepal width, petal length, petal width and species.

# read csv from external source
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# or, alternatively, use the built-in dataset
iris = sns.load_dataset("iris", cache=True, data_home=None)
iris

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

The dataset is tidy, with each variable forming a column, each observation forming a row

First, we rename the columns to make them more readable

iris.rename(columns={'sepal_length': 'Sepal Length', 'sepal_width': 'Sepal Width', 'petal_length': 'Petal Length', 'petal_width': 'Petal Width', 'species': 'Species'}, inplace=True)
iris.head()

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Next, we reorder the columns by Sepal Length

iris.sort_values(by='Sepal Length', ascending=False).head()

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
131	7.9	3.8	6.4	2.0	virginica
135	7.7	3.0	6.1	2.3	virginica
122	7.7	2.8	6.7	2.0	virginica
117	7.7	3.8	6.7	2.2	virginica
118	7.7	2.6	6.9	2.3	virginica

Data Subsetting

Subsetting is the process of creating a new dataset from an existing dataset based on some criteria.

Subsetting can be done by rows (observations), columns (variables) or both.

Subsetting rows

Sub-setting rows by index

Just pick the first 5 rows of the dataset

iris.iloc[0:5]

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Select randomly

Select 5 random rows from the dataset

iris.sample(5)

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
56	6.3	3.3	4.7	1.6	versicolor
96	5.7	2.9	4.2	1.3	versicolor
10	5.4	3.7	1.5	0.2	setosa
144	6.7	3.3	5.7	2.5	virginica
115	6.4	3.2	5.3	2.3	virginica

Select according to a condition

Select only rows with largest (smallest) petal length

iris.nlargest(5, 'Sepal Length')

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
131	7.9	3.8	6.4	2.0	virginica
117	7.7	3.8	6.7	2.2	virginica
118	7.7	2.6	6.9	2.3	virginica
122	7.7	2.8	6.7	2.0	virginica
135	7.7	3.0	6.1	2.3	virginica

iris.nsmallest(5, 'Sepal Length')

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
13	4.3	3.0	1.1	0.1	setosa
8	4.4	2.9	1.4	0.2	setosa
38	4.4	3.0	1.3	0.2	setosa
42	4.4	3.2	1.3	0.2	setosa
41	4.5	2.3	1.3	0.3	setosa

Sub-setting based on column values

We create a subset of the dataset with only the species “Iris setosa”.

iris[iris.Species == 'setosa'].head()

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Sub-setting cols

Select columns by name

Select only the columns “Sepal Length” and “Species”

iris[['Sepal Length', 'Species']]

	Sepal Length	Species
0	5.1	setosa
1	4.9	setosa
2	4.7	setosa
3	4.6	setosa
4	5.0	setosa
...	...	...
145	6.7	virginica
146	6.3	virginica
147	6.5	virginica
148	6.2	virginica
149	5.9	virginica

150 rows × 2 columns

Select columns by index

Select only the first three columns

iris.iloc[:,[1,2,3]]

	Sepal Width	Petal Length	Petal Width
0	3.5	1.4	0.2
1	3.0	1.4	0.2
2	3.2	1.3	0.2
3	3.1	1.5	0.2
4	3.6	1.4	0.2
...	...	...	...
145	3.0	5.2	2.3
146	2.5	5.0	1.9
147	3.0	5.2	2.0
148	3.4	5.4	2.3
149	3.0	5.1	1.8

150 rows × 3 columns

Select columns by regular expression

Select only columns that contain the word “Length”

iris.filter(regex='Length$')

	Sepal Length	Petal Length
0	5.1	1.4
1	4.9	1.4
2	4.7	1.3
3	4.6	1.5
4	5.0	1.4
...	...	...
145	6.7	5.2
146	6.3	5.0
147	6.5	5.2
148	6.2	5.4
149	5.9	5.1

150 rows × 2 columns

Sub-setting rows and columns jointly

Select only the rows for which the column Species is “setosa” and the columns “Sepal Length” and “Species”

iris.loc[iris.Species == 'setosa',['Sepal Length', 'Species']].head()

	Sepal Length	Species
0	5.1	setosa
1	4.9	setosa
2	4.7	setosa
3	4.6	setosa
4	5.0	setosa

Data Reshaping

Data reshaping is the process of transforming data from one format into another.

The dataset is in a wide format, with each row representing a single observation and each column representing a variable.

iris.head()

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

From wide to long format

Convert the dataset from wide to long format
- In the long format, different variables are “collapsed” into two variables: one for the variable names and one for the variable values.

iris_melt = iris.melt(
  id_vars='Species',
  value_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
  var_name='Measurement',
  value_name='Value_cm',
)
iris_melt

	Species	Measurement	Value_cm
0	setosa	Sepal Length	5.1
1	setosa	Sepal Length	4.9
2	setosa	Sepal Length	4.7
3	setosa	Sepal Length	4.6
4	setosa	Sepal Length	5.0
...	...	...	...
595	virginica	Petal Width	2.3
596	virginica	Petal Width	1.9
597	virginica	Petal Width	2.0
598	virginica	Petal Width	2.3
599	virginica	Petal Width	1.8

600 rows × 3 columns

From long to wide format

! Duplicated entries: for a species there are several entries for the same attribute (e.g., sepal length) Solution: each row is a unique combination of species and attribute. reset:index() method is used to reset the index of the DataFrame.

iris_melt = iris.reset_index().melt(
  id_vars=['index', 'Species'],
  value_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
  var_name='Measurement',
  value_name='Value_cm',
)

iris_melt.pivot(
   index=['index', 'Species'],
   columns='Measurement',
   values='Value_cm'
).reset_index().drop(columns=['index'])

Measurement	Species	Petal Length	Petal Width	Sepal Length	Sepal Width
0	setosa	1.4	0.2	5.1	3.5
1	setosa	1.4	0.2	4.9	3.0
2	setosa	1.3	0.2	4.7	3.2
3	setosa	1.5	0.2	4.6	3.1
4	setosa	1.4	0.2	5.0	3.6
...	...	...	...	...	...
145	virginica	5.2	2.3	6.7	3.0
146	virginica	5.0	1.9	6.3	2.5
147	virginica	5.2	2.0	6.5	3.0
148	virginica	5.4	2.3	6.2	3.4
149	virginica	5.1	1.8	5.9	3.0

150 rows × 5 columns

Append data

Append datasets to each other

Append rows

Append datasets to each other with columns aligned by name

pd.concat(
[iris[iris.Species == 'setosa'],
iris[iris.Species == 'virginica']
]
)

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

100 rows × 5 columns

Append side by side

pd.concat(
[iris[iris.Species == 'setosa'],
iris[iris.Species == 'virginica']
],
axis=1
)

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	setosa	NaN	NaN	NaN	NaN	NaN
1	4.9	3.0	1.4	0.2	setosa	NaN	NaN	NaN	NaN	NaN
2	4.7	3.2	1.3	0.2	setosa	NaN	NaN	NaN	NaN	NaN
3	4.6	3.1	1.5	0.2	setosa	NaN	NaN	NaN	NaN	NaN
4	5.0	3.6	1.4	0.2	setosa	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
145	NaN	NaN	NaN	NaN	NaN	6.7	3.0	5.2	2.3	virginica
146	NaN	NaN	NaN	NaN	NaN	6.3	2.5	5.0	1.9	virginica
147	NaN	NaN	NaN	NaN	NaN	6.5	3.0	5.2	2.0	virginica
148	NaN	NaN	NaN	NaN	NaN	6.2	3.4	5.4	2.3	virginica
149	NaN	NaN	NaN	NaN	NaN	5.9	3.0	5.1	1.8	virginica

100 rows × 10 columns

Merge data

Merge datasets to each other based on a common column

A = iris.reset_index().iloc[0:5, 0:2]
A

	index	Sepal Length
0	0	5.1
1	1	4.9
2	2	4.7
3	3	4.6
4	4	5.0

B = iris.reset_index().iloc[2:7, [0, 3,4, 5]]
B

	index	Petal Length	Petal Width	Species
2	2	1.3	0.2	setosa
3	3	1.5	0.2	setosa
4	4	1.4	0.2	setosa
5	5	1.7	0.4	setosa
6	6	1.4	0.3	setosa

Dataset A and B are merged based on the column “index”

Inner join

Only the intersection of the keys is returned

pd.merge(
A,
B,
how='inner',
on='index'
)

	index	Sepal Length	Petal Length	Petal Width	Species
0	2	4.7	1.3	0.2	setosa
1	3	4.6	1.5	0.2	setosa
2	4	5.0	1.4	0.2	setosa

Outer join

All keys are returned

pd.merge(
A,
B,
how='outer',
on='index'
)

	index	Sepal Length	Petal Length	Petal Width	Species
0	0	5.1	NaN	NaN	NaN
1	1	4.9	NaN	NaN	NaN
2	2	4.7	1.3	0.2	setosa
3	3	4.6	1.5	0.2	setosa
4	4	5.0	1.4	0.2	setosa
5	5	NaN	1.7	0.4	setosa
6	6	NaN	1.4	0.3	setosa

Left join

All keys in the left DataFrame are returned

pd.merge(
A,
B,
how='left',
on='index'
)

	index	Sepal Length	Petal Length	Petal Width	Species
0	0	5.1	NaN	NaN	NaN
1	1	4.9	NaN	NaN	NaN
2	2	4.7	1.3	0.2	setosa
3	3	4.6	1.5	0.2	setosa
4	4	5.0	1.4	0.2	setosa

Write data to external file

Write datasets to external files is useful to work with other tools or to share data with others.

CSV is a common format for data exchange.

iris.to_csv('./iris.csv', index=False)

but many other formats are available (e.g., Excel, JSON, SQL, etc.)

import xlsxwriter

iris.to_excel('./iris.xlsx', engine='xlsxwriter', index=False)

Sources

Books

McKinney, W. (2013). Python for data analysis. ” O’Reilly Media, Inc.”.
O’Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. ” O’Reilly Media, Inc.”

	Sepal Width	Petal Length	Petal Width
0	3.5	1.4	0.2
1	3.0	1.4	0.2
2	3.2	1.3	0.2
3	3.1	1.5	0.2
4	3.6	1.4	0.2
...	...	...	...
145	3.0	5.2	2.3
146	2.5	5.0	1.9
147	3.0	5.2	2.0
148	3.4	5.4	2.3
149	3.0	5.1	1.8

	Sepal Length	Petal Length
0	5.1	1.4
1	4.9	1.4
2	4.7	1.3
3	4.6	1.5
4	5.0	1.4
...	...	...
145	6.7	5.2
146	6.3	5.0
147	6.5	5.2
148	6.2	5.4
149	5.9	5.1

	Sepal Width	Petal Length	Petal Width
0	3.5	1.4	0.2
1	3.0	1.4	0.2
2	3.2	1.3	0.2
3	3.1	1.5	0.2
4	3.6	1.4	0.2
...	...	...	...
145	3.0	5.2	2.3
146	2.5	5.0	1.9
147	3.0	5.2	2.0
148	3.4	5.4	2.3
149	3.0	5.1	1.8

	Sepal Length	Petal Length
0	5.1	1.4
1	4.9	1.4
2	4.7	1.3
3	4.6	1.5
4	5.0	1.4
...	...	...
145	6.7	5.2
146	6.3	5.0
147	6.5	5.2
148	6.2	5.4
149	5.9	5.1