Python | Data Summary

An introduction to data management and manipulation in Python (2/3)
Author

Matteo Ploner (University of Trento, Italy)

Published

September 4, 2024

This notebook gives an illustration of how to use Python to summarize data.

The data used in this example is the diamond dataset, which is available in the Seaborn library.

import pandas as pd
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

The dataset

diamonds = sns.load_dataset("diamonds", cache=True, data_home=None)
diamonds.head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
Variable Description Values
price price in US dollars $326-$18,823
carat weight of the diamond 0.2-5.01
cut quality of the cut Fair, Good, Very Good, Premium, Ideal
color diamond color J (worst) to D (best)
clarity measurement of how clear the diamond is I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
x length in mm 0-10.74
y width in mm 0-58.9
z depth in mm 0-31.8
depth total depth percentage 43-79
table width of top of diamond relative to widest point 43-95
# 1. How many rows and columns are there in diamonds DataFrame?
diamonds.shape[0]
# 2. What are the column names in diamonds DataFrame?
diamonds.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

Description of the dataset

The dataset contains 10 variables and 53,940 observations. The variables are a mix of numerical and categorical variables.

  • Numerical variables: price, carat, x, y, z, depth, table
  • Categorical variables: cut, color, clarity

describe() is a method that provides a summary of the numerical variables in the dataset. It returns the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of the numerical variables.

html_table = diamonds.describe().round(2).to_html(index=True)
display(html_table)
carat depth table price x y z
count 53940.00 53940.00 53940.00 53940.00 53940.00 53940.00 53940.00
mean 0.80 61.75 57.46 3932.80 5.73 5.73 3.54
std 0.47 1.43 2.23 3989.44 1.12 1.14 0.71
min 0.20 43.00 43.00 326.00 0.00 0.00 0.00
25% 0.40 61.00 56.00 950.00 4.71 4.72 2.91
50% 0.70 61.80 57.00 2401.00 5.70 5.71 3.53
75% 1.04 62.50 59.00 5324.25 6.54 6.54 4.04
max 5.01 79.00 95.00 18823.00 10.74 58.90 31.80
  • We prefer to visualize the variables in rows rather than columns. This is because it is easier to compare the values of the variables when they are in rows.
stats = diamonds.describe()
stats['stats'] = stats.index

melt = stats.melt(id_vars='stats', var_name='Variable', value_name='Value')
melt.pivot(index='Variable', columns='stats', values='Value').round(2)
stats 25% 50% 75% count max mean min std
Variable
carat 0.40 0.70 1.04 53940.0 5.01 0.80 0.2 0.47
depth 61.00 61.80 62.50 53940.0 79.00 61.75 43.0 1.43
price 950.00 2401.00 5324.25 53940.0 18823.00 3932.80 326.0 3989.44
table 56.00 57.00 59.00 53940.0 95.00 57.46 43.0 2.23
x 4.71 5.70 6.54 53940.0 10.74 5.73 0.0 1.12
y 4.72 5.71 6.54 53940.0 58.90 5.73 0.0 1.14
z 2.91 3.53 4.04 53940.0 31.80 3.54 0.0 0.71

For the categorical variables: cut, color, clarity we compute the frequency table.

from IPython.display import display, HTML

# Convert the value counts to a DataFrame for nicer formatting
cut_counts_df = diamonds['cut'].value_counts().reset_index()
cut_counts_df.columns = ['Cut', 'Frequency']

# Convert the DataFrame to an HTML table and display it in Jupyter Notebook
html_table = cut_counts_df.to_html(index=False)
display(HTML(html_table))
Cut Frequency
Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
# Convert the color value counts to a DataFrame for better HTML formatting
color_counts_df = diamonds['color'].value_counts().reset_index()
color_counts_df.columns = ['Color', 'Frequency']

# Convert the DataFrame to an HTML table
html_table = color_counts_df.to_html(index=False)
display(HTML(html_table))
Color Frequency
G 11292
E 9797
F 9542
H 8304
D 6775
I 5422
J 2808

# Convert the clarity value counts to a DataFrame and then to HTML
clarity_counts_html = diamonds['clarity'].value_counts().to_frame().reset_index().rename(columns={'index': 'Clarity', 'clarity': 'Frequency'}).to_html(index=False)

# Display the HTML table
display(HTML(clarity_counts_html))
Frequency count
SI1 13065
VS2 12258
SI2 9194
VS1 8171
VVS2 5066
VVS1 3655
IF 1790
I1 741

From absolute frequencies we compute the relative frequencies.

diamonds['clarity'].value_counts(normalize=True).to_frame().round(3)
proportion
clarity
SI1 0.242
VS2 0.227
SI2 0.170
VS1 0.151
VVS2 0.094
VVS1 0.068
IF 0.033
I1 0.014

We put absolute frequency and relative frequency next to each other to make it easier to compare the values.

pd.merge(
    diamonds['clarity'].value_counts(),
    diamonds['clarity'].value_counts(normalize=True).round(3),
    on="clarity",
    how="inner"
)
count proportion
clarity
SI1 13065 0.242
VS2 12258 0.227
SI2 9194 0.170
VS1 8171 0.151
VVS2 5066 0.094
VVS1 3655 0.068
IF 1790 0.033
I1 741 0.014

Create variables from existing variables

We can create new variables from existing variables. Several functions can be used to create new variables.

diamonds.head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

Mathematical operations

From the variable price we can create a new variable price_discount as follows:

\(price\_discount = price - price*0.1\)

diamonds["price_discount"] = diamonds["price"] - diamonds["price"] * 0.1
diamonds.head()
carat cut color clarity depth table price x y z price_discount
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 293.4
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 293.4
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 294.3
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 300.6
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 301.5

Conditional operations

We can condition the new variable on values in other variables as well.

For example, we can create a new variable Top as follows:

  • If the cut is Ideal or Premium, then Top is Yes; otherwise, Top is No.
def func(mod):
    if mod['cut'] == 'Ideal' or mod['cut'] == 'Premium':
        return 'Yes'
    else:
        return 'No'
diamonds['Top'] = diamonds.apply(func, axis=1)
diamonds.head()
carat cut color clarity depth table price x y z price_discount Top
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 293.4 Yes
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 293.4 Yes
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 294.3 No
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 300.6 Yes
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 301.5 No

Summary statistics

We can apply several functions to summarize the data.

  • count(): count the number of observations
  • mean(): compute the mean
  • median(): compute the median
  • std(): compute the standard deviation
  • min(): compute the minimum value
  • max(): compute the maximum value
columns_of_interest = ['price', 'carat', 'x', 'y', 'z', 'depth', 'table']
summary_data = []

for column in columns_of_interest:
    summary_data.append([
        diamonds[column].count(), 
        diamonds[column].mean(),
        diamonds[column].median(),
        diamonds[column].std(),
        diamonds[column].min(),
        diamonds[column].max()
    ])

summary_df = pd.DataFrame(
    summary_data, 
    columns=['Count', 'Mean', 'Median', 'StdDev', 'Min', 'Max'], 
    index=["Price", "Carat", "X", "Y", "Z", "Depth", "Table"]
)

# Display the DataFrame
summary_df.round(2).to_html()
display(HTML(summary_df.round(2).to_html()))
Count Mean Median StdDev Min Max
Price 53940 3932.80 2401.00 3989.44 326.0 18823.00
Carat 53940 0.80 0.70 0.47 0.2 5.01
X 53940 5.73 5.70 1.12 0.0 10.74
Y 53940 5.73 5.71 1.14 0.0 58.90
Z 53940 3.54 3.53 0.71 0.0 31.80
Depth 53940 61.75 61.80 1.43 43.0 79.00
Table 53940 57.46 57.00 2.23 43.0 95.00

Summary statistics conditional on a categorical variable

Summarize the data is to compute the summary statistics conditional on a categorical variable.

  • For example, we can compute the descriptive statistics of price for each level of the variable color.


# Group the data by 'color', calculate descriptive statistics for 'price', and round the results to 2 decimal places
price_summary_by_color = diamonds.groupby("color")["price"].describe().round(2).sort_values(by=('mean'), ascending=False)

# Display the summary statistics
display(HTML(price_summary_by_color.to_html()))
count mean std min 25% 50% 75% max
color
J 2808.0 5323.82 4438.19 335.0 1860.5 4234.0 7695.00 18710.0
I 5422.0 5091.87 4722.39 334.0 1120.5 3730.0 7201.75 18823.0
H 8304.0 4486.67 4215.94 337.0 984.0 3460.0 5980.25 18803.0
G 11292.0 3999.14 4051.10 354.0 931.0 2242.0 6048.00 18818.0
F 9542.0 3724.89 3784.99 342.0 982.0 2343.5 4868.25 18791.0
D 6775.0 3169.95 3356.59 357.0 911.0 1838.0 4213.50 18693.0
E 9797.0 3076.75 3344.16 326.0 882.0 1739.0 4003.00 18731.0

Summary statistics using pivot tables

Conditional summary statistics can be computed using pivot tables.

diamonds.pivot_table(index='color', values=['price'], aggfunc=['count','mean', 'std', 'min', 'max']).round(2).sort_values(by=('mean', 'price'), ascending=False)
count mean std min max
price price price price price
color
J 2808 5323.82 4438.19 335 18710
I 5422 5091.87 4722.39 334 18823
H 8304 4486.67 4215.94 337 18803
G 11292 3999.14 4051.10 354 18818
F 9542 3724.89 3784.99 342 18791
D 6775 3169.95 3356.59 357 18693
E 9797 3076.75 3344.16 326 18731
  • This is efficient and neat