Python | Data Summary

An introduction to data management and manipulation in Python (2/3)

Author

Matteo Ploner (University of Trento, Italy)

Published

September 4, 2024

This notebook gives an illustration of how to use Python to summarize data.

The data used in this example is the diamond dataset, which is available in the Seaborn library.

import pandas as pd
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

The dataset

diamonds = sns.load_dataset("diamonds", cache=True, data_home=None)
diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

Variable	Description	Values
price	price in US dollars	$326-$18,823
carat	weight of the diamond	0.2-5.01
cut	quality of the cut	Fair, Good, Very Good, Premium, Ideal
color	diamond color	J (worst) to D (best)
clarity	measurement of how clear the diamond is	I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
x	length in mm	0-10.74
y	width in mm	0-58.9
z	depth in mm	0-31.8
depth	total depth percentage	43-79
table	width of top of diamond relative to widest point	43-95

# 1. How many rows and columns are there in diamonds DataFrame?
diamonds.shape[0]
# 2. What are the column names in diamonds DataFrame?
diamonds.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

Description of the dataset

The dataset contains 10 variables and 53,940 observations. The variables are a mix of numerical and categorical variables.

Numerical variables: price, carat, x, y, z, depth, table
Categorical variables: cut, color, clarity

describe() is a method that provides a summary of the numerical variables in the dataset. It returns the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of the numerical variables.

html_table = diamonds.describe().round(2).to_html(index=True)
display(html_table)

	carat	depth	table	price	x	y	z
count	53940.00	53940.00	53940.00	53940.00	53940.00	53940.00	53940.00
mean	0.80	61.75	57.46	3932.80	5.73	5.73	3.54
std	0.47	1.43	2.23	3989.44	1.12	1.14	0.71
min	0.20	43.00	43.00	326.00	0.00	0.00	0.00
25%	0.40	61.00	56.00	950.00	4.71	4.72	2.91
50%	0.70	61.80	57.00	2401.00	5.70	5.71	3.53
75%	1.04	62.50	59.00	5324.25	6.54	6.54	4.04
max	5.01	79.00	95.00	18823.00	10.74	58.90	31.80

We prefer to visualize the variables in rows rather than columns. This is because it is easier to compare the values of the variables when they are in rows.

stats = diamonds.describe()
stats['stats'] = stats.index

melt = stats.melt(id_vars='stats', var_name='Variable', value_name='Value')
melt.pivot(index='Variable', columns='stats', values='Value').round(2)

stats	25%	50%	75%	count	max	mean	min	std
Variable
carat	0.40	0.70	1.04	53940.0	5.01	0.80	0.2	0.47
depth	61.00	61.80	62.50	53940.0	79.00	61.75	43.0	1.43
price	950.00	2401.00	5324.25	53940.0	18823.00	3932.80	326.0	3989.44
table	56.00	57.00	59.00	53940.0	95.00	57.46	43.0	2.23
x	4.71	5.70	6.54	53940.0	10.74	5.73	0.0	1.12
y	4.72	5.71	6.54	53940.0	58.90	5.73	0.0	1.14
z	2.91	3.53	4.04	53940.0	31.80	3.54	0.0	0.71

For the categorical variables: cut, color, clarity we compute the frequency table.

from IPython.display import display, HTML

# Convert the value counts to a DataFrame for nicer formatting
cut_counts_df = diamonds['cut'].value_counts().reset_index()
cut_counts_df.columns = ['Cut', 'Frequency']

# Convert the DataFrame to an HTML table and display it in Jupyter Notebook
html_table = cut_counts_df.to_html(index=False)
display(HTML(html_table))

Cut	Frequency
Ideal	21551
Premium	13791
Very Good	12082
Good	4906
Fair	1610

# Convert the color value counts to a DataFrame for better HTML formatting
color_counts_df = diamonds['color'].value_counts().reset_index()
color_counts_df.columns = ['Color', 'Frequency']

# Convert the DataFrame to an HTML table
html_table = color_counts_df.to_html(index=False)
display(HTML(html_table))

Color	Frequency
G	11292
E	9797
F	9542
H	8304
D	6775
I	5422
J	2808


# Convert the clarity value counts to a DataFrame and then to HTML
clarity_counts_html = diamonds['clarity'].value_counts().to_frame().reset_index().rename(columns={'index': 'Clarity', 'clarity': 'Frequency'}).to_html(index=False)

# Display the HTML table
display(HTML(clarity_counts_html))

Frequency	count
SI1	13065
VS2	12258
SI2	9194
VS1	8171
VVS2	5066
VVS1	3655
IF	1790
I1	741

From absolute frequencies we compute the relative frequencies.

diamonds['clarity'].value_counts(normalize=True).to_frame().round(3)

	proportion
clarity
SI1	0.242
VS2	0.227
SI2	0.170
VS1	0.151
VVS2	0.094
VVS1	0.068
IF	0.033
I1	0.014

We put absolute frequency and relative frequency next to each other to make it easier to compare the values.

pd.merge(
    diamonds['clarity'].value_counts(),
    diamonds['clarity'].value_counts(normalize=True).round(3),
    on="clarity",
    how="inner"
)

	count	proportion
clarity
SI1	13065	0.242
VS2	12258	0.227
SI2	9194	0.170
VS1	8171	0.151
VVS2	5066	0.094
VVS1	3655	0.068
IF	1790	0.033
I1	741	0.014

Create variables from existing variables

We can create new variables from existing variables. Several functions can be used to create new variables.

diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

Mathematical operations

From the variable price we can create a new variable price_discount as follows:

$price\_discount = price - price*0.1$

diamonds["price_discount"] = diamonds["price"] - diamonds["price"] * 0.1
diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z	price_discount
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43	293.4
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31	293.4
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31	294.3
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63	300.6
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75	301.5

Conditional operations

We can condition the new variable on values in other variables as well.

For example, we can create a new variable Top as follows:

If the cut is Ideal or Premium, then Top is Yes; otherwise, Top is No.

def func(mod):
    if mod['cut'] == 'Ideal' or mod['cut'] == 'Premium':
        return 'Yes'
    else:
        return 'No'
diamonds['Top'] = diamonds.apply(func, axis=1)
diamonds.head()

	carat	cut	color	clarity	depth	table	price	x	y	z	price_discount	Top
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43	293.4	Yes
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31	293.4	Yes
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31	294.3	No
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63	300.6	Yes
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75	301.5	No

Summary statistics

We can apply several functions to summarize the data.

count(): count the number of observations
mean(): compute the mean
median(): compute the median
std(): compute the standard deviation
min(): compute the minimum value
max(): compute the maximum value

columns_of_interest = ['price', 'carat', 'x', 'y', 'z', 'depth', 'table']
summary_data = []

for column in columns_of_interest:
    summary_data.append([
        diamonds[column].count(), 
        diamonds[column].mean(),
        diamonds[column].median(),
        diamonds[column].std(),
        diamonds[column].min(),
        diamonds[column].max()
    ])

summary_df = pd.DataFrame(
    summary_data, 
    columns=['Count', 'Mean', 'Median', 'StdDev', 'Min', 'Max'], 
    index=["Price", "Carat", "X", "Y", "Z", "Depth", "Table"]
)

# Display the DataFrame
summary_df.round(2).to_html()
display(HTML(summary_df.round(2).to_html()))

	Count	Mean	Median	StdDev	Min	Max
Price	53940	3932.80	2401.00	3989.44	326.0	18823.00
Carat	53940	0.80	0.70	0.47	0.2	5.01
X	53940	5.73	5.70	1.12	0.0	10.74
Y	53940	5.73	5.71	1.14	0.0	58.90
Z	53940	3.54	3.53	0.71	0.0	31.80
Depth	53940	61.75	61.80	1.43	43.0	79.00
Table	53940	57.46	57.00	2.23	43.0	95.00

Summary statistics conditional on a categorical variable

Summarize the data is to compute the summary statistics conditional on a categorical variable.

For example, we can compute the descriptive statistics of price for each level of the variable color.



# Group the data by 'color', calculate descriptive statistics for 'price', and round the results to 2 decimal places
price_summary_by_color = diamonds.groupby("color")["price"].describe().round(2).sort_values(by=('mean'), ascending=False)

# Display the summary statistics
display(HTML(price_summary_by_color.to_html()))

	count	mean	std	min	25%	50%	75%	max
color
J	2808.0	5323.82	4438.19	335.0	1860.5	4234.0	7695.00	18710.0
I	5422.0	5091.87	4722.39	334.0	1120.5	3730.0	7201.75	18823.0
H	8304.0	4486.67	4215.94	337.0	984.0	3460.0	5980.25	18803.0
G	11292.0	3999.14	4051.10	354.0	931.0	2242.0	6048.00	18818.0
F	9542.0	3724.89	3784.99	342.0	982.0	2343.5	4868.25	18791.0
D	6775.0	3169.95	3356.59	357.0	911.0	1838.0	4213.50	18693.0
E	9797.0	3076.75	3344.16	326.0	882.0	1739.0	4003.00	18731.0

Summary statistics using pivot tables

Conditional summary statistics can be computed using pivot tables.

diamonds.pivot_table(index='color', values=['price'], aggfunc=['count','mean', 'std', 'min', 'max']).round(2).sort_values(by=('mean', 'price'), ascending=False)

	count	mean	std	min	max
	price	price	price	price	price
color
J	2808	5323.82	4438.19	335	18710
I	5422	5091.87	4722.39	334	18823
H	8304	4486.67	4215.94	337	18803
G	11292	3999.14	4051.10	354	18818
F	9542	3724.89	3784.99	342	18791
D	6775	3169.95	3356.59	357	18693
E	9797	3076.75	3344.16	326	18731

This is efficient and neat