import pandas as pd
import seaborn as sns
import numpy as np
Python | Data Wrangling
This notebook illustrates a simple data workflow using Python. The workflow includes the following steps:
- Data loading
- Data subsetting
- Data transformation
- Data shaping
- Data merging
- Data export
Module loading
Import modules for data manipulation and visualization.
- Pandas: library for data manipulation and analysis
- Numpy: library for numerical computing
- Seaborn: library for data visualization
Data Loading
Import a demo dataset from Seaborn library (the “famous” iris dataset) to illustrate the workflow.
A set of 150 records of flowers attributes taken by Ronald Fisher in 1936.
- 50 samples for three species of iris: Iris setosa, Iris virginica and Iris versicolor
- Five attributes: sepal length, sepal width, petal length, petal width and species.
# read csv from external source
= pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris # or, alternatively, use the built-in dataset
= sns.load_dataset("iris", cache=True, data_home=None)
iris iris
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
The dataset is tidy, with each variable forming a column, each observation forming a row
- First, we rename the columns to make them more readable
={'sepal_length': 'Sepal Length', 'sepal_width': 'Sepal Width', 'petal_length': 'Petal Length', 'petal_width': 'Petal Width', 'species': 'Species'}, inplace=True)
iris.rename(columns iris.head()
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
- Next, we reorder the columns by Sepal Length
='Sepal Length', ascending=False).head() iris.sort_values(by
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
Data Subsetting
Subsetting is the process of creating a new dataset from an existing dataset based on some criteria.
Subsetting can be done by rows (observations), columns (variables) or both.
Subsetting rows
Sub-setting rows by index
- Just pick the first 5 rows of the dataset
0:5] iris.iloc[
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Select randomly
- Select 5 random rows from the dataset
5) iris.sample(
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
56 | 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
96 | 5.7 | 2.9 | 4.2 | 1.3 | versicolor |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
115 | 6.4 | 3.2 | 5.3 | 2.3 | virginica |
Select according to a condition
- Select only rows with largest (smallest) petal length
5, 'Sepal Length') iris.nlargest(
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
5, 'Sepal Length') iris.nsmallest(
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
38 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
42 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
Sub-setting based on column values
- We create a subset of the dataset with only the species “Iris setosa”.
== 'setosa'].head() iris[iris.Species
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Sub-setting cols
Select columns by name
- Select only the columns “Sepal Length” and “Species”
'Sepal Length', 'Species']] iris[[
Sepal Length | Species | |
---|---|---|
0 | 5.1 | setosa |
1 | 4.9 | setosa |
2 | 4.7 | setosa |
3 | 4.6 | setosa |
4 | 5.0 | setosa |
... | ... | ... |
145 | 6.7 | virginica |
146 | 6.3 | virginica |
147 | 6.5 | virginica |
148 | 6.2 | virginica |
149 | 5.9 | virginica |
150 rows × 2 columns
Select columns by index
- Select only the first three columns
1,2,3]] iris.iloc[:,[
Sepal Width | Petal Length | Petal Width | |
---|---|---|---|
0 | 3.5 | 1.4 | 0.2 |
1 | 3.0 | 1.4 | 0.2 |
2 | 3.2 | 1.3 | 0.2 |
3 | 3.1 | 1.5 | 0.2 |
4 | 3.6 | 1.4 | 0.2 |
... | ... | ... | ... |
145 | 3.0 | 5.2 | 2.3 |
146 | 2.5 | 5.0 | 1.9 |
147 | 3.0 | 5.2 | 2.0 |
148 | 3.4 | 5.4 | 2.3 |
149 | 3.0 | 5.1 | 1.8 |
150 rows × 3 columns
Select columns by regular expression
- Select only columns that contain the word “Length”
filter(regex='Length$') iris.
Sepal Length | Petal Length | |
---|---|---|
0 | 5.1 | 1.4 |
1 | 4.9 | 1.4 |
2 | 4.7 | 1.3 |
3 | 4.6 | 1.5 |
4 | 5.0 | 1.4 |
... | ... | ... |
145 | 6.7 | 5.2 |
146 | 6.3 | 5.0 |
147 | 6.5 | 5.2 |
148 | 6.2 | 5.4 |
149 | 5.9 | 5.1 |
150 rows × 2 columns
Sub-setting rows and columns jointly
- Select only the rows for which the column Species is “setosa” and the columns “Sepal Length” and “Species”
== 'setosa',['Sepal Length', 'Species']].head() iris.loc[iris.Species
Sepal Length | Species | |
---|---|---|
0 | 5.1 | setosa |
1 | 4.9 | setosa |
2 | 4.7 | setosa |
3 | 4.6 | setosa |
4 | 5.0 | setosa |
Data Reshaping
Data reshaping is the process of transforming data from one format into another.
The dataset is in a wide format, with each row representing a single observation and each column representing a variable.
iris.head()
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
From wide to long format
- Convert the dataset from wide to long format
- In the long format, different variables are “collapsed” into two variables: one for the variable names and one for the variable values.
= iris.melt(
iris_melt ='Species',
id_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
value_vars='Measurement',
var_name='Value_cm',
value_name
) iris_melt
Species | Measurement | Value_cm | |
---|---|---|---|
0 | setosa | Sepal Length | 5.1 |
1 | setosa | Sepal Length | 4.9 |
2 | setosa | Sepal Length | 4.7 |
3 | setosa | Sepal Length | 4.6 |
4 | setosa | Sepal Length | 5.0 |
... | ... | ... | ... |
595 | virginica | Petal Width | 2.3 |
596 | virginica | Petal Width | 1.9 |
597 | virginica | Petal Width | 2.0 |
598 | virginica | Petal Width | 2.3 |
599 | virginica | Petal Width | 1.8 |
600 rows × 3 columns
From long to wide format
! Duplicated entries: for a species there are several entries for the same attribute (e.g., sepal length) Solution: each row is a unique combination of species and attribute. reset:index()
method is used to reset the index of the DataFrame.
= iris.reset_index().melt(
iris_melt =['index', 'Species'],
id_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
value_vars='Measurement',
var_name='Value_cm',
value_name
)
iris_melt.pivot(=['index', 'Species'],
index='Measurement',
columns='Value_cm'
values=['index']) ).reset_index().drop(columns
Measurement | Species | Petal Length | Petal Width | Sepal Length | Sepal Width |
---|---|---|---|---|---|
0 | setosa | 1.4 | 0.2 | 5.1 | 3.5 |
1 | setosa | 1.4 | 0.2 | 4.9 | 3.0 |
2 | setosa | 1.3 | 0.2 | 4.7 | 3.2 |
3 | setosa | 1.5 | 0.2 | 4.6 | 3.1 |
4 | setosa | 1.4 | 0.2 | 5.0 | 3.6 |
... | ... | ... | ... | ... | ... |
145 | virginica | 5.2 | 2.3 | 6.7 | 3.0 |
146 | virginica | 5.0 | 1.9 | 6.3 | 2.5 |
147 | virginica | 5.2 | 2.0 | 6.5 | 3.0 |
148 | virginica | 5.4 | 2.3 | 6.2 | 3.4 |
149 | virginica | 5.1 | 1.8 | 5.9 | 3.0 |
150 rows × 5 columns
Append data
- Append datasets to each other
Append rows
- Append datasets to each other with columns aligned by name
pd.concat(== 'setosa'],
[iris[iris.Species == 'virginica']
iris[iris.Species
] )
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
100 rows × 5 columns
Append side by side
pd.concat(== 'setosa'],
[iris[iris.Species == 'virginica']
iris[iris.Species
],=1
axis )
Sepal Length | Sepal Width | Petal Length | Petal Width | Species | Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | NaN | NaN | NaN | NaN | NaN | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | NaN | NaN | NaN | NaN | NaN | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | NaN | NaN | NaN | NaN | NaN | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | NaN | NaN | NaN | NaN | NaN | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | NaN | NaN | NaN | NaN | NaN | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
100 rows × 10 columns
Merge data
- Merge datasets to each other based on a common column
= iris.reset_index().iloc[0:5, 0:2]
A A
index | Sepal Length | |
---|---|---|
0 | 0 | 5.1 |
1 | 1 | 4.9 |
2 | 2 | 4.7 |
3 | 3 | 4.6 |
4 | 4 | 5.0 |
= iris.reset_index().iloc[2:7, [0, 3,4, 5]]
B B
index | Petal Length | Petal Width | Species | |
---|---|---|---|---|
2 | 2 | 1.3 | 0.2 | setosa |
3 | 3 | 1.5 | 0.2 | setosa |
4 | 4 | 1.4 | 0.2 | setosa |
5 | 5 | 1.7 | 0.4 | setosa |
6 | 6 | 1.4 | 0.3 | setosa |
Dataset A and B are merged based on the column “index”
Inner join
- Only the intersection of the keys is returned
pd.merge(
A,
B,='inner',
how='index'
on )
index | Sepal Length | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 2 | 4.7 | 1.3 | 0.2 | setosa |
1 | 3 | 4.6 | 1.5 | 0.2 | setosa |
2 | 4 | 5.0 | 1.4 | 0.2 | setosa |
Outer join
- All keys are returned
pd.merge(
A,
B,='outer',
how='index'
on )
index | Sepal Length | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 0 | 5.1 | NaN | NaN | NaN |
1 | 1 | 4.9 | NaN | NaN | NaN |
2 | 2 | 4.7 | 1.3 | 0.2 | setosa |
3 | 3 | 4.6 | 1.5 | 0.2 | setosa |
4 | 4 | 5.0 | 1.4 | 0.2 | setosa |
5 | 5 | NaN | 1.7 | 0.4 | setosa |
6 | 6 | NaN | 1.4 | 0.3 | setosa |
Left join
- All keys in the left DataFrame are returned
pd.merge(
A,
B,='left',
how='index'
on )
index | Sepal Length | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|
0 | 0 | 5.1 | NaN | NaN | NaN |
1 | 1 | 4.9 | NaN | NaN | NaN |
2 | 2 | 4.7 | 1.3 | 0.2 | setosa |
3 | 3 | 4.6 | 1.5 | 0.2 | setosa |
4 | 4 | 5.0 | 1.4 | 0.2 | setosa |
Write data to external file
Write datasets to external files is useful to work with other tools or to share data with others.
CSV is a common format for data exchange.
'./iris.csv', index=False)
iris.to_csv(
but many other formats are available (e.g., Excel, JSON, SQL, etc.)
import xlsxwriter
'./iris.xlsx', engine='xlsxwriter', index=False) iris.to_excel(
Sources
Books
- McKinney, W. (2013). Python for data analysis. ” O’Reilly Media, Inc.”.
- O’Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. ” O’Reilly Media, Inc.”