import pandas as pd
import seaborn as sns
import numpy as npPython | Data Wrangling
This notebook illustrates a simple data workflow using Python. The workflow includes the following steps:
- Data loading
- Data subsetting
- Data transformation
- Data shaping
- Data merging
- Data export
Module loading
Import modules for data manipulation and visualization.
- Pandas: library for data manipulation and analysis
- Numpy: library for numerical computing
- Seaborn: library for data visualization
Data Loading
Import a demo dataset from Seaborn library (the “famous” iris dataset) to illustrate the workflow.
A set of 150 records of flowers attributes taken by Ronald Fisher in 1936.
- 50 samples for three species of iris: Iris setosa, Iris virginica and Iris versicolor
- Five attributes: sepal length, sepal width, petal length, petal width and species.
# read csv from external source
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# or, alternatively, use the built-in dataset
iris = sns.load_dataset("iris", cache=True, data_home=None)
iris| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
The dataset is tidy, with each variable forming a column, each observation forming a row
- First, we rename the columns to make them more readable
iris.rename(columns={'sepal_length': 'Sepal Length', 'sepal_width': 'Sepal Width', 'petal_length': 'Petal Length', 'petal_width': 'Petal Width', 'species': 'Species'}, inplace=True)
iris.head()| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
- Next, we reorder the columns by Sepal Length
iris.sort_values(by='Sepal Length', ascending=False).head()| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
Data Subsetting
Subsetting is the process of creating a new dataset from an existing dataset based on some criteria.
Subsetting can be done by rows (observations), columns (variables) or both.
Subsetting rows
Sub-setting rows by index
- Just pick the first 5 rows of the dataset
iris.iloc[0:5]| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Select randomly
- Select 5 random rows from the dataset
iris.sample(5)| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 56 | 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
| 96 | 5.7 | 2.9 | 4.2 | 1.3 | versicolor |
| 10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
| 144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 115 | 6.4 | 3.2 | 5.3 | 2.3 | virginica |
Select according to a condition
- Select only rows with largest (smallest) petal length
iris.nlargest(5, 'Sepal Length')| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
iris.nsmallest(5, 'Sepal Length')| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 38 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
| 42 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
Sub-setting based on column values
- We create a subset of the dataset with only the species “Iris setosa”.
iris[iris.Species == 'setosa'].head()| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Sub-setting cols
Select columns by name
- Select only the columns “Sepal Length” and “Species”
iris[['Sepal Length', 'Species']]| Sepal Length | Species | |
|---|---|---|
| 0 | 5.1 | setosa |
| 1 | 4.9 | setosa |
| 2 | 4.7 | setosa |
| 3 | 4.6 | setosa |
| 4 | 5.0 | setosa |
| ... | ... | ... |
| 145 | 6.7 | virginica |
| 146 | 6.3 | virginica |
| 147 | 6.5 | virginica |
| 148 | 6.2 | virginica |
| 149 | 5.9 | virginica |
150 rows × 2 columns
Select columns by index
- Select only the first three columns
iris.iloc[:,[1,2,3]]| Sepal Width | Petal Length | Petal Width | |
|---|---|---|---|
| 0 | 3.5 | 1.4 | 0.2 |
| 1 | 3.0 | 1.4 | 0.2 |
| 2 | 3.2 | 1.3 | 0.2 |
| 3 | 3.1 | 1.5 | 0.2 |
| 4 | 3.6 | 1.4 | 0.2 |
| ... | ... | ... | ... |
| 145 | 3.0 | 5.2 | 2.3 |
| 146 | 2.5 | 5.0 | 1.9 |
| 147 | 3.0 | 5.2 | 2.0 |
| 148 | 3.4 | 5.4 | 2.3 |
| 149 | 3.0 | 5.1 | 1.8 |
150 rows × 3 columns
Select columns by regular expression
- Select only columns that contain the word “Length”
iris.filter(regex='Length$')| Sepal Length | Petal Length | |
|---|---|---|
| 0 | 5.1 | 1.4 |
| 1 | 4.9 | 1.4 |
| 2 | 4.7 | 1.3 |
| 3 | 4.6 | 1.5 |
| 4 | 5.0 | 1.4 |
| ... | ... | ... |
| 145 | 6.7 | 5.2 |
| 146 | 6.3 | 5.0 |
| 147 | 6.5 | 5.2 |
| 148 | 6.2 | 5.4 |
| 149 | 5.9 | 5.1 |
150 rows × 2 columns
Sub-setting rows and columns jointly
- Select only the rows for which the column Species is “setosa” and the columns “Sepal Length” and “Species”
iris.loc[iris.Species == 'setosa',['Sepal Length', 'Species']].head()| Sepal Length | Species | |
|---|---|---|
| 0 | 5.1 | setosa |
| 1 | 4.9 | setosa |
| 2 | 4.7 | setosa |
| 3 | 4.6 | setosa |
| 4 | 5.0 | setosa |
Data Reshaping
Data reshaping is the process of transforming data from one format into another.
The dataset is in a wide format, with each row representing a single observation and each column representing a variable.
iris.head()| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
From wide to long format
- Convert the dataset from wide to long format
- In the long format, different variables are “collapsed” into two variables: one for the variable names and one for the variable values.
iris_melt = iris.melt(
id_vars='Species',
value_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
var_name='Measurement',
value_name='Value_cm',
)
iris_melt| Species | Measurement | Value_cm | |
|---|---|---|---|
| 0 | setosa | Sepal Length | 5.1 |
| 1 | setosa | Sepal Length | 4.9 |
| 2 | setosa | Sepal Length | 4.7 |
| 3 | setosa | Sepal Length | 4.6 |
| 4 | setosa | Sepal Length | 5.0 |
| ... | ... | ... | ... |
| 595 | virginica | Petal Width | 2.3 |
| 596 | virginica | Petal Width | 1.9 |
| 597 | virginica | Petal Width | 2.0 |
| 598 | virginica | Petal Width | 2.3 |
| 599 | virginica | Petal Width | 1.8 |
600 rows × 3 columns
From long to wide format
! Duplicated entries: for a species there are several entries for the same attribute (e.g., sepal length) Solution: each row is a unique combination of species and attribute. reset:index() method is used to reset the index of the DataFrame.
iris_melt = iris.reset_index().melt(
id_vars=['index', 'Species'],
value_vars=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'],
var_name='Measurement',
value_name='Value_cm',
)
iris_melt.pivot(
index=['index', 'Species'],
columns='Measurement',
values='Value_cm'
).reset_index().drop(columns=['index'])| Measurement | Species | Petal Length | Petal Width | Sepal Length | Sepal Width |
|---|---|---|---|---|---|
| 0 | setosa | 1.4 | 0.2 | 5.1 | 3.5 |
| 1 | setosa | 1.4 | 0.2 | 4.9 | 3.0 |
| 2 | setosa | 1.3 | 0.2 | 4.7 | 3.2 |
| 3 | setosa | 1.5 | 0.2 | 4.6 | 3.1 |
| 4 | setosa | 1.4 | 0.2 | 5.0 | 3.6 |
| ... | ... | ... | ... | ... | ... |
| 145 | virginica | 5.2 | 2.3 | 6.7 | 3.0 |
| 146 | virginica | 5.0 | 1.9 | 6.3 | 2.5 |
| 147 | virginica | 5.2 | 2.0 | 6.5 | 3.0 |
| 148 | virginica | 5.4 | 2.3 | 6.2 | 3.4 |
| 149 | virginica | 5.1 | 1.8 | 5.9 | 3.0 |
150 rows × 5 columns
Append data
- Append datasets to each other
Append rows
- Append datasets to each other with columns aligned by name
pd.concat(
[iris[iris.Species == 'setosa'],
iris[iris.Species == 'virginica']
]
)| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
100 rows × 5 columns
Append side by side
pd.concat(
[iris[iris.Species == 'setosa'],
iris[iris.Species == 'virginica']
],
axis=1
)| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145 | NaN | NaN | NaN | NaN | NaN | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | NaN | NaN | NaN | NaN | NaN | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | NaN | NaN | NaN | NaN | NaN | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | NaN | NaN | NaN | NaN | NaN | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | NaN | NaN | NaN | NaN | NaN | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
100 rows × 10 columns
Merge data
- Merge datasets to each other based on a common column
A = iris.reset_index().iloc[0:5, 0:2]
A| index | Sepal Length | |
|---|---|---|
| 0 | 0 | 5.1 |
| 1 | 1 | 4.9 |
| 2 | 2 | 4.7 |
| 3 | 3 | 4.6 |
| 4 | 4 | 5.0 |
B = iris.reset_index().iloc[2:7, [0, 3,4, 5]]
B| index | Petal Length | Petal Width | Species | |
|---|---|---|---|---|
| 2 | 2 | 1.3 | 0.2 | setosa |
| 3 | 3 | 1.5 | 0.2 | setosa |
| 4 | 4 | 1.4 | 0.2 | setosa |
| 5 | 5 | 1.7 | 0.4 | setosa |
| 6 | 6 | 1.4 | 0.3 | setosa |
Dataset A and B are merged based on the column “index”
Inner join
- Only the intersection of the keys is returned
pd.merge(
A,
B,
how='inner',
on='index'
)| index | Sepal Length | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 2 | 4.7 | 1.3 | 0.2 | setosa |
| 1 | 3 | 4.6 | 1.5 | 0.2 | setosa |
| 2 | 4 | 5.0 | 1.4 | 0.2 | setosa |
Outer join
- All keys are returned
pd.merge(
A,
B,
how='outer',
on='index'
)| index | Sepal Length | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 0 | 5.1 | NaN | NaN | NaN |
| 1 | 1 | 4.9 | NaN | NaN | NaN |
| 2 | 2 | 4.7 | 1.3 | 0.2 | setosa |
| 3 | 3 | 4.6 | 1.5 | 0.2 | setosa |
| 4 | 4 | 5.0 | 1.4 | 0.2 | setosa |
| 5 | 5 | NaN | 1.7 | 0.4 | setosa |
| 6 | 6 | NaN | 1.4 | 0.3 | setosa |
Left join
- All keys in the left DataFrame are returned
pd.merge(
A,
B,
how='left',
on='index'
)| index | Sepal Length | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 0 | 5.1 | NaN | NaN | NaN |
| 1 | 1 | 4.9 | NaN | NaN | NaN |
| 2 | 2 | 4.7 | 1.3 | 0.2 | setosa |
| 3 | 3 | 4.6 | 1.5 | 0.2 | setosa |
| 4 | 4 | 5.0 | 1.4 | 0.2 | setosa |
Write data to external file
Write datasets to external files is useful to work with other tools or to share data with others.
CSV is a common format for data exchange.
iris.to_csv('./iris.csv', index=False)
but many other formats are available (e.g., Excel, JSON, SQL, etc.)
import xlsxwriter
iris.to_excel('./iris.xlsx', engine='xlsxwriter', index=False)Sources
Books
- McKinney, W. (2013). Python for data analysis. ” O’Reilly Media, Inc.”.
- O’Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. ” O’Reilly Media, Inc.”