Python | Data Representation

An introduction to data management and manipulation in Python (3/3)

Author

Matteo Ploner (University of Trento, Italy)

Published

September 4, 2024

This notebook is a simple example of how to use Python to provide a graphical representation of data.

The data used in this example is the titanic dataset, which is available in the Seaborn library.

import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid")
import matplotlib.pyplot as plt

import warnings
warnings.simplefilter(action='ignore')

titanic = sns.load_dataset("titanic", cache=True, data_home=None).loc[:,['survived','pclass','sex','age','sibsp','parch','fare','embarked'] ]
titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

Variable	Definition	Key
survived	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
fare	Passenger fare
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Barplots

A barplot is a graphical representation of the data in which the data values are represented by horizontal or vertical bars. The length of the bars is proportional to the values they represent.

surv_rate = pd.DataFrame((titanic["survived"].value_counts() / len(titanic))*100)
surv_rate.reset_index(inplace=True)
surv_rate.columns = ['survived', 'rate']

plt.figure(figsize=(9, 6))  # Set the figure size
g = sns.barplot(
    data=surv_rate,
    x='survived',
    y='rate',
    edgecolor="black",  # Set the color of the contour to black
    linewidth=1.5,      # Set the width of the contour lines
    palette=["darkgrey", "white"]
)

# Set the labels for the axes
g.set_xlabel("Status", fontsize=14)
g.set_ylabel("Survival Rate", fontsize=14)

# Set the x-axis labels
g.set_xticklabels(['Not Survived', 'Survived'])
# Set a title
g.set_title("Survival Rate of Titanic Passengers", fontsize=16, fontweight='bold', y=1.02)

# Adding text with frequencies on top of the bars
for p in g.patches:
    g.text(p.get_x() + p.get_width() / 2., p.get_height(), f'{p.get_height():.2f}%', 
           ha='center', va='bottom', color='black', fontsize=10)
    

plt.show()

Stacked bar chart

A stacked bar chart is a type of bar chart that is used to represent the data in a way that the bars are divided into segments. Each segment represents a different category of the data.

Vertical

non_survived_rate = surv_rate.query('survived == 0')["rate"].values[0]
survived_rate = surv_rate.query('survived == 1')["rate"].values[0]


# Create a single stacked bar
plt.figure(figsize=(9, 6))  # Adjusted figure size for a single bar
plt.bar(0, non_survived_rate, color="darkgrey", edgecolor='black', label="Not Survived")
plt.bar(0, survived_rate, bottom=non_survived_rate, color="white", edgecolor='black', label="Survived")

# Adding the legend
plt.legend()

rates = surv_rate["rate"]
# Adding text with frequencies on top of the bars
plt.text(0, non_survived_rate/2, f"{non_survived_rate:.2f}%", ha='center', va='center', color='black', fontsize=10)
plt.text(0, float(non_survived_rate) + float(survived_rate)/2, f"{survived_rate:.2f}%", ha='center', va='center', color='black', fontsize=10)

# Set the labels for the axes
plt.ylabel("Percentage", fontsize=14)

# Set a title
plt.title("Survival Rate of Titanic Passengers", fontsize=16, fontweight='bold')

# Remove x-axis labels
plt.xticks([])

plt.show()

Horizontal

# Assuming surv_rate is a DataFrame with 'survived' and 'rate' columns
# Calculate the starting point for the second segment of the stacked bar
non_survived_rate = surv_rate.query('survived == 0')["rate"].values[0]
survived_rate = surv_rate.query('survived == 1')["rate"].values[0]

# Create a single stacked horizontal bar
plt.figure(figsize=(9, 6))  # Adjusted figure size for a single bar
plt.barh(0, non_survived_rate, color="darkgrey", edgecolor='black', label="Not Survived")
plt.barh(0, survived_rate, left=non_survived_rate, color="white", edgecolor='black', label="Survived")

# Adding the legend
plt.legend()

# Adding text with frequencies on top of the bars
plt.text(non_survived_rate / 2, 0, f"{non_survived_rate:.2f}%", va='center', ha='center', color='black', fontsize=10)
plt.text(non_survived_rate + survived_rate / 2, 0, f"{survived_rate:.2f}%", va='center', ha='center', color='black', fontsize=10)

# Set the labels for the axes
plt.xlabel("Percentage", fontsize=14)

# Set a title
plt.title("Survival Rate of Titanic Passengers", fontsize=16, fontweight='bold')

# Remove y-axis labels
plt.yticks([])

plt.show()

Survival rate by grouping variable

In this example, we will create a bar chart to represent the survival rate by ticket class.

# Calculate relative frequencies
relative_freq = titanic.groupby(['pclass', 'survived']).size()
relative_freq = relative_freq / relative_freq.groupby(level=0).sum()
# Convert to percentage
relative_freq = relative_freq * 100
relative_freq = relative_freq.reset_index(name='Percentage')

# Create the plot
g = sns.catplot(
    data=relative_freq,
    x='survived',
    y='Percentage',
    col='pclass',
    kind='bar',
    edgecolor="black",  # Set the color of the contour to black
    linewidth=1.5,      # Set the width of the contour lines
    palette=["darkgrey", "white"],
    height=6,  # Adjust the height of each facet
    aspect=.5,  # Adjust the width of each facet
    order=['0', '1']  # Ensure consistent ordering of bars
)

# Set the labels for the x-axis
g.set_xticklabels(["Not Survived", "Survived"])
# Set the labels for the y-axis
g.set_ylabels("Percentage")
# Set the labels for the x-axis
g.set_xlabels("Status")
# Set custom titles for each facet
g.set_titles("Passenger Class {col_name}")
g.fig.suptitle('Survival Rates by Passenger Class', fontsize=16, fontweight='bold', y=1.02)

# Iterate through each subplot / Facet in the grid
for ax in g.axes.flat:
    # For each bar in the subplot, place a label
    for p in ax.patches:
        # Get the height of the bar (which is the percentage in this case)
        height = p.get_height()
        # Place the text on top of the bar, including a percentage sign
        ax.text(p.get_x() + p.get_width() / 2., height + 0.5, f'{height:.2f}%', ha="center")

Histograms

A histogram is a graphical representation of the distribution of a dataset. It is a type of bar chart that represents the frequency of the data values in a dataset.

In the following example we create a histogram to represent the distribution of the *age of the passengers.

On top of the histogram, we will add a kernel density estimate (KDE) plot. A KDE plot is a non-parametric way to estimate the probability density function of a random variable.

We report also the mean of the age of the passengers.

plt.figure(figsize=(9, 6))  # Adjusted figure size for a single bar
sns.histplot(data=titanic, x="age", kde=True, bins=30, color='darkgrey', edgecolor='black', linewidth=1.5)
# Average line
plt.axvline(titanic["age"].mean(), color='red', linestyle='--', label='Mean Age')
# Add text mean
min_ylim, max_ylim = plt.ylim()
plt.text(titanic["age"].mean()+1, max_ylim*0.9, f'Mean Age: {titanic["age"].mean():.2f}', color='red', fontsize=10)
# Set the labels for the x-axis
plt.xlabel("Age", fontsize=14)
# Title for the plot
plt.title("Age Distribution of Titanic Passengers", fontsize=16, fontweight='bold')

plt.show()

Boxplots

A boxplot is a graphical representation of the distribution of a dataset.

The boxplot is a standardized way of displaying the distribution of data based on a four-value summary: range (minimum-maximum), first quartile, median, third quartile.

In the following example, we create a boxplot to represent the distribution of the Fare paid by passengers by Passenger Class.

Without outliers

plt.figure(figsize=(9, 6))  # Adjusted figure size for a single bar
g = sns.boxplot(data=titanic, x='pclass', y='fare', showfliers=False,showmeans=True, color="lightgrey",meanprops={'marker':'o','markerfacecolor':'grey','markeredgecolor':'black','markersize':'8'})

# add title
plt.title("Fare paid by Class", loc="left")
# add x-axis label
plt.xlabel("Passenger Class")
# add y-axis label
plt.ylabel("Fare")


# show the graph
plt.show()

With outliers and all data points

plt.figure(figsize=(9, 6))  # Adjusted figure size for a single bar
g = sns.stripplot(x='pclass', y='fare', data=titanic, color="grey", jitter=0.2, size=2.5)
g = sns.boxplot(data=titanic, x='pclass', y='fare', showfliers=True,showmeans=True, color="lightgrey",meanprops={'marker':'o','markerfacecolor':'grey','markeredgecolor':'black','markersize':'8'})
# # add stripplot


# add title
plt.title("Fare paid by Class", loc="left")
# add x-axis label
plt.xlabel("Passenger Class")
# add y-axis label
plt.ylabel("Fare")


# show the graph
plt.show()

Lines and points

A line plot is a graphical representation of data in which the data values (points) are connected by a line.

It is useful to represent the trend of the data over time. Note: the current dataset does not offer a time variable.

In the following example, we create a line plot to represent the average fare paid by passengers by age.

tbl = pd.DataFrame(titanic.groupby(["age"])["fare"].mean().round(2))
# Reset the index to use 'age' as a column
tbl.reset_index(inplace=True)

# Plotting the line plot
plt.figure(figsize=(9, 6))

# Plot the dots with a darker color
plt.scatter(tbl['age'], tbl['fare'], color='grey', marker='o')

# Plot the line with a lighter color
plt.plot(tbl['age'], tbl['fare'], color='lightgrey')

plt.title('Average Fare by Age')
plt.xlabel('Age')
plt.ylabel('Average Fare')
plt.grid(True)
plt.show()

Heatmaps

A heatmap is a graphical representation of data in which the data values are represented by colors and is useful to capture values with three dimensions.

The colors are used to represent the intensity of the data values.

In the following example, we create a heatmap to represent the joint distribution of survival rate for variables class and sex in the titanic dataset.



tbl = pd.DataFrame(titanic.groupby(["pclass","sex"])["survived"].mean().round(2))
tbl.reset_index(inplace=True)
tbl_w = tbl.pivot(index='pclass', columns='sex', values='survived')

sns.heatmap(tbl_w, annot=True, fmt="g", cmap='viridis')
plt.show()