import pandas as pd
import seaborn as sns
="whitegrid")
sns.set_theme(styleimport matplotlib.pyplot as plt
import warnings
='ignore') warnings.simplefilter(action
Python | Data Representation
This notebook is a simple example of how to use Python to provide a graphical representation of data.
The data used in this example is the titanic dataset, which is available in the Seaborn
library.
= sns.load_dataset("titanic", cache=True, data_home=None).loc[:,['survived','pclass','sex','age','sibsp','parch','fare','embarked'] ]
titanic titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
Variable | Definition | Key |
---|---|---|
survived | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
fare | Passenger fare | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Barplots
A barplot is a graphical representation of the data in which the data values are represented by horizontal or vertical bars. The length of the bars is proportional to the values they represent.
= pd.DataFrame((titanic["survived"].value_counts() / len(titanic))*100)
surv_rate =True)
surv_rate.reset_index(inplace= ['survived', 'rate']
surv_rate.columns
=(9, 6)) # Set the figure size
plt.figure(figsize= sns.barplot(
g =surv_rate,
data='survived',
x='rate',
y="black", # Set the color of the contour to black
edgecolor=1.5, # Set the width of the contour lines
linewidth=["darkgrey", "white"]
palette
)
# Set the labels for the axes
"Status", fontsize=14)
g.set_xlabel("Survival Rate", fontsize=14)
g.set_ylabel(
# Set the x-axis labels
'Not Survived', 'Survived'])
g.set_xticklabels([# Set a title
"Survival Rate of Titanic Passengers", fontsize=16, fontweight='bold', y=1.02)
g.set_title(
# Adding text with frequencies on top of the bars
for p in g.patches:
+ p.get_width() / 2., p.get_height(), f'{p.get_height():.2f}%',
g.text(p.get_x() ='center', va='bottom', color='black', fontsize=10)
ha
plt.show()
Stacked bar chart
A stacked bar chart is a type of bar chart that is used to represent the data in a way that the bars are divided into segments. Each segment represents a different category of the data.
- Vertical
= surv_rate.query('survived == 0')["rate"].values[0]
non_survived_rate = surv_rate.query('survived == 1')["rate"].values[0]
survived_rate
# Create a single stacked bar
=(9, 6)) # Adjusted figure size for a single bar
plt.figure(figsize0, non_survived_rate, color="darkgrey", edgecolor='black', label="Not Survived")
plt.bar(0, survived_rate, bottom=non_survived_rate, color="white", edgecolor='black', label="Survived")
plt.bar(
# Adding the legend
plt.legend()
= surv_rate["rate"]
rates # Adding text with frequencies on top of the bars
0, non_survived_rate/2, f"{non_survived_rate:.2f}%", ha='center', va='center', color='black', fontsize=10)
plt.text(0, float(non_survived_rate) + float(survived_rate)/2, f"{survived_rate:.2f}%", ha='center', va='center', color='black', fontsize=10)
plt.text(
# Set the labels for the axes
"Percentage", fontsize=14)
plt.ylabel(
# Set a title
"Survival Rate of Titanic Passengers", fontsize=16, fontweight='bold')
plt.title(
# Remove x-axis labels
plt.xticks([])
plt.show()
- Horizontal
# Assuming surv_rate is a DataFrame with 'survived' and 'rate' columns
# Calculate the starting point for the second segment of the stacked bar
= surv_rate.query('survived == 0')["rate"].values[0]
non_survived_rate = surv_rate.query('survived == 1')["rate"].values[0]
survived_rate
# Create a single stacked horizontal bar
=(9, 6)) # Adjusted figure size for a single bar
plt.figure(figsize0, non_survived_rate, color="darkgrey", edgecolor='black', label="Not Survived")
plt.barh(0, survived_rate, left=non_survived_rate, color="white", edgecolor='black', label="Survived")
plt.barh(
# Adding the legend
plt.legend()
# Adding text with frequencies on top of the bars
/ 2, 0, f"{non_survived_rate:.2f}%", va='center', ha='center', color='black', fontsize=10)
plt.text(non_survived_rate + survived_rate / 2, 0, f"{survived_rate:.2f}%", va='center', ha='center', color='black', fontsize=10)
plt.text(non_survived_rate
# Set the labels for the axes
"Percentage", fontsize=14)
plt.xlabel(
# Set a title
"Survival Rate of Titanic Passengers", fontsize=16, fontweight='bold')
plt.title(
# Remove y-axis labels
plt.yticks([])
plt.show()
Survival rate by grouping variable
In this example, we will create a bar chart to represent the survival rate by ticket class.
# Calculate relative frequencies
= titanic.groupby(['pclass', 'survived']).size()
relative_freq = relative_freq / relative_freq.groupby(level=0).sum()
relative_freq # Convert to percentage
= relative_freq * 100
relative_freq = relative_freq.reset_index(name='Percentage')
relative_freq
# Create the plot
= sns.catplot(
g =relative_freq,
data='survived',
x='Percentage',
y='pclass',
col='bar',
kind="black", # Set the color of the contour to black
edgecolor=1.5, # Set the width of the contour lines
linewidth=["darkgrey", "white"],
palette=6, # Adjust the height of each facet
height=.5, # Adjust the width of each facet
aspect=['0', '1'] # Ensure consistent ordering of bars
order
)
# Set the labels for the x-axis
"Not Survived", "Survived"])
g.set_xticklabels([# Set the labels for the y-axis
"Percentage")
g.set_ylabels(# Set the labels for the x-axis
"Status")
g.set_xlabels(# Set custom titles for each facet
"Passenger Class {col_name}")
g.set_titles('Survival Rates by Passenger Class', fontsize=16, fontweight='bold', y=1.02)
g.fig.suptitle(
# Iterate through each subplot / Facet in the grid
for ax in g.axes.flat:
# For each bar in the subplot, place a label
for p in ax.patches:
# Get the height of the bar (which is the percentage in this case)
= p.get_height()
height # Place the text on top of the bar, including a percentage sign
+ p.get_width() / 2., height + 0.5, f'{height:.2f}%', ha="center") ax.text(p.get_x()
Histograms
A histogram is a graphical representation of the distribution of a dataset. It is a type of bar chart that represents the frequency of the data values in a dataset.
In the following example we create a histogram to represent the distribution of the *age of the passengers.
On top of the histogram, we will add a kernel density estimate (KDE) plot. A KDE plot is a non-parametric way to estimate the probability density function of a random variable.
We report also the mean of the age of the passengers.
=(9, 6)) # Adjusted figure size for a single bar
plt.figure(figsize=titanic, x="age", kde=True, bins=30, color='darkgrey', edgecolor='black', linewidth=1.5)
sns.histplot(data# Average line
"age"].mean(), color='red', linestyle='--', label='Mean Age')
plt.axvline(titanic[# Add text mean
= plt.ylim()
min_ylim, max_ylim "age"].mean()+1, max_ylim*0.9, f'Mean Age: {titanic["age"].mean():.2f}', color='red', fontsize=10)
plt.text(titanic[# Set the labels for the x-axis
"Age", fontsize=14)
plt.xlabel(# Title for the plot
"Age Distribution of Titanic Passengers", fontsize=16, fontweight='bold')
plt.title(
plt.show()
Boxplots
A boxplot is a graphical representation of the distribution of a dataset.
The boxplot is a standardized way of displaying the distribution of data based on a four-value summary: range (minimum-maximum), first quartile, median, third quartile.
In the following example, we create a boxplot to represent the distribution of the Fare paid by passengers by Passenger Class.
- Without outliers
=(9, 6)) # Adjusted figure size for a single bar
plt.figure(figsize= sns.boxplot(data=titanic, x='pclass', y='fare', showfliers=False,showmeans=True, color="lightgrey",meanprops={'marker':'o','markerfacecolor':'grey','markeredgecolor':'black','markersize':'8'})
g
# add title
"Fare paid by Class", loc="left")
plt.title(# add x-axis label
"Passenger Class")
plt.xlabel(# add y-axis label
"Fare")
plt.ylabel(
# show the graph
plt.show()
- With outliers and all data points
=(9, 6)) # Adjusted figure size for a single bar
plt.figure(figsize= sns.stripplot(x='pclass', y='fare', data=titanic, color="grey", jitter=0.2, size=2.5)
g = sns.boxplot(data=titanic, x='pclass', y='fare', showfliers=True,showmeans=True, color="lightgrey",meanprops={'marker':'o','markerfacecolor':'grey','markeredgecolor':'black','markersize':'8'})
g # # add stripplot
# add title
"Fare paid by Class", loc="left")
plt.title(# add x-axis label
"Passenger Class")
plt.xlabel(# add y-axis label
"Fare")
plt.ylabel(
# show the graph
plt.show()
Lines and points
A line plot is a graphical representation of data in which the data values (points) are connected by a line.
It is useful to represent the trend of the data over time. Note: the current dataset does not offer a time variable.
In the following example, we create a line plot to represent the average fare paid by passengers by age.
= pd.DataFrame(titanic.groupby(["age"])["fare"].mean().round(2))
tbl # Reset the index to use 'age' as a column
=True)
tbl.reset_index(inplace
# Plotting the line plot
=(9, 6))
plt.figure(figsize
# Plot the dots with a darker color
'age'], tbl['fare'], color='grey', marker='o')
plt.scatter(tbl[
# Plot the line with a lighter color
'age'], tbl['fare'], color='lightgrey')
plt.plot(tbl[
'Average Fare by Age')
plt.title('Age')
plt.xlabel('Average Fare')
plt.ylabel(True)
plt.grid( plt.show()
Heatmaps
A heatmap is a graphical representation of data in which the data values are represented by colors and is useful to capture values with three dimensions.
The colors are used to represent the intensity of the data values.
In the following example, we create a heatmap to represent the joint distribution of survival rate for variables class and sex in the titanic dataset.
= pd.DataFrame(titanic.groupby(["pclass","sex"])["survived"].mean().round(2))
tbl =True)
tbl.reset_index(inplace= tbl.pivot(index='pclass', columns='sex', values='survived')
tbl_w
=True, fmt="g", cmap='viridis')
sns.heatmap(tbl_w, annot plt.show()