import pandas as pd
import seaborn as sns
import warnings
='ignore', category=FutureWarning) warnings.simplefilter(action
Python | Data Summary
This notebook gives an illustration of how to use Python to summarize data.
The data used in this example is the diamond dataset, which is available in the Seaborn
library.
The dataset
= sns.load_dataset("diamonds", cache=True, data_home=None)
diamonds diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
Variable | Description | Values |
---|---|---|
price | price in US dollars | $326-$18,823 |
carat | weight of the diamond | 0.2-5.01 |
cut | quality of the cut | Fair, Good, Very Good, Premium, Ideal |
color | diamond color | J (worst) to D (best) |
clarity | measurement of how clear the diamond is | I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best) |
x | length in mm | 0-10.74 |
y | width in mm | 0-58.9 |
z | depth in mm | 0-31.8 |
depth | total depth percentage | 43-79 |
table | width of top of diamond relative to widest point | 43-95 |
# 1. How many rows and columns are there in diamonds DataFrame?
0]
diamonds.shape[# 2. What are the column names in diamonds DataFrame?
diamonds.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
'z'],
dtype='object')
Description of the dataset
The dataset contains 10 variables and 53,940 observations. The variables are a mix of numerical and categorical variables.
- Numerical variables: price, carat, x, y, z, depth, table
- Categorical variables: cut, color, clarity
describe() is a method that provides a summary of the numerical variables in the dataset. It returns the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of the numerical variables.
= diamonds.describe().round(2).to_html(index=True)
html_table display(html_table)
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
count | 53940.00 | 53940.00 | 53940.00 | 53940.00 | 53940.00 | 53940.00 | 53940.00 |
mean | 0.80 | 61.75 | 57.46 | 3932.80 | 5.73 | 5.73 | 3.54 |
std | 0.47 | 1.43 | 2.23 | 3989.44 | 1.12 | 1.14 | 0.71 |
min | 0.20 | 43.00 | 43.00 | 326.00 | 0.00 | 0.00 | 0.00 |
25% | 0.40 | 61.00 | 56.00 | 950.00 | 4.71 | 4.72 | 2.91 |
50% | 0.70 | 61.80 | 57.00 | 2401.00 | 5.70 | 5.71 | 3.53 |
75% | 1.04 | 62.50 | 59.00 | 5324.25 | 6.54 | 6.54 | 4.04 |
max | 5.01 | 79.00 | 95.00 | 18823.00 | 10.74 | 58.90 | 31.80 |
- We prefer to visualize the variables in rows rather than columns. This is because it is easier to compare the values of the variables when they are in rows.
= diamonds.describe()
stats 'stats'] = stats.index
stats[
= stats.melt(id_vars='stats', var_name='Variable', value_name='Value')
melt ='Variable', columns='stats', values='Value').round(2) melt.pivot(index
stats | 25% | 50% | 75% | count | max | mean | min | std |
---|---|---|---|---|---|---|---|---|
Variable | ||||||||
carat | 0.40 | 0.70 | 1.04 | 53940.0 | 5.01 | 0.80 | 0.2 | 0.47 |
depth | 61.00 | 61.80 | 62.50 | 53940.0 | 79.00 | 61.75 | 43.0 | 1.43 |
price | 950.00 | 2401.00 | 5324.25 | 53940.0 | 18823.00 | 3932.80 | 326.0 | 3989.44 |
table | 56.00 | 57.00 | 59.00 | 53940.0 | 95.00 | 57.46 | 43.0 | 2.23 |
x | 4.71 | 5.70 | 6.54 | 53940.0 | 10.74 | 5.73 | 0.0 | 1.12 |
y | 4.72 | 5.71 | 6.54 | 53940.0 | 58.90 | 5.73 | 0.0 | 1.14 |
z | 2.91 | 3.53 | 4.04 | 53940.0 | 31.80 | 3.54 | 0.0 | 0.71 |
For the categorical variables: cut, color, clarity we compute the frequency table.
from IPython.display import display, HTML
# Convert the value counts to a DataFrame for nicer formatting
= diamonds['cut'].value_counts().reset_index()
cut_counts_df = ['Cut', 'Frequency']
cut_counts_df.columns
# Convert the DataFrame to an HTML table and display it in Jupyter Notebook
= cut_counts_df.to_html(index=False)
html_table display(HTML(html_table))
Cut | Frequency |
---|---|
Ideal | 21551 |
Premium | 13791 |
Very Good | 12082 |
Good | 4906 |
Fair | 1610 |
# Convert the color value counts to a DataFrame for better HTML formatting
= diamonds['color'].value_counts().reset_index()
color_counts_df = ['Color', 'Frequency']
color_counts_df.columns
# Convert the DataFrame to an HTML table
= color_counts_df.to_html(index=False)
html_table display(HTML(html_table))
Color | Frequency |
---|---|
G | 11292 |
E | 9797 |
F | 9542 |
H | 8304 |
D | 6775 |
I | 5422 |
J | 2808 |
# Convert the clarity value counts to a DataFrame and then to HTML
= diamonds['clarity'].value_counts().to_frame().reset_index().rename(columns={'index': 'Clarity', 'clarity': 'Frequency'}).to_html(index=False)
clarity_counts_html
# Display the HTML table
display(HTML(clarity_counts_html))
Frequency | count |
---|---|
SI1 | 13065 |
VS2 | 12258 |
SI2 | 9194 |
VS1 | 8171 |
VVS2 | 5066 |
VVS1 | 3655 |
IF | 1790 |
I1 | 741 |
From absolute frequencies we compute the relative frequencies.
'clarity'].value_counts(normalize=True).to_frame().round(3) diamonds[
proportion | |
---|---|
clarity | |
SI1 | 0.242 |
VS2 | 0.227 |
SI2 | 0.170 |
VS1 | 0.151 |
VVS2 | 0.094 |
VVS1 | 0.068 |
IF | 0.033 |
I1 | 0.014 |
We put absolute frequency and relative frequency next to each other to make it easier to compare the values.
pd.merge('clarity'].value_counts(),
diamonds['clarity'].value_counts(normalize=True).round(3),
diamonds[="clarity",
on="inner"
how )
count | proportion | |
---|---|---|
clarity | ||
SI1 | 13065 | 0.242 |
VS2 | 12258 | 0.227 |
SI2 | 9194 | 0.170 |
VS1 | 8171 | 0.151 |
VVS2 | 5066 | 0.094 |
VVS1 | 3655 | 0.068 |
IF | 1790 | 0.033 |
I1 | 741 | 0.014 |
Create variables from existing variables
We can create new variables from existing variables. Several functions can be used to create new variables.
diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
Mathematical operations
From the variable price we can create a new variable price_discount as follows:
\(price\_discount = price - price*0.1\)
"price_discount"] = diamonds["price"] - diamonds["price"] * 0.1
diamonds[ diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | price_discount | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 293.4 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 293.4 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 294.3 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 300.6 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 301.5 |
Conditional operations
We can condition the new variable on values in other variables as well.
For example, we can create a new variable Top as follows:
- If the cut is Ideal or Premium, then Top is Yes; otherwise, Top is No.
def func(mod):
if mod['cut'] == 'Ideal' or mod['cut'] == 'Premium':
return 'Yes'
else:
return 'No'
'Top'] = diamonds.apply(func, axis=1)
diamonds[ diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | price_discount | Top | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 293.4 | Yes |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 293.4 | Yes |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 294.3 | No |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 300.6 | Yes |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 301.5 | No |
Summary statistics
We can apply several functions to summarize the data.
- count(): count the number of observations
- mean(): compute the mean
- median(): compute the median
- std(): compute the standard deviation
- min(): compute the minimum value
- max(): compute the maximum value
= ['price', 'carat', 'x', 'y', 'z', 'depth', 'table']
columns_of_interest = []
summary_data
for column in columns_of_interest:
summary_data.append([
diamonds[column].count(),
diamonds[column].mean(),
diamonds[column].median(),
diamonds[column].std(),min(),
diamonds[column].max()
diamonds[column].
])
= pd.DataFrame(
summary_df
summary_data, =['Count', 'Mean', 'Median', 'StdDev', 'Min', 'Max'],
columns=["Price", "Carat", "X", "Y", "Z", "Depth", "Table"]
index
)
# Display the DataFrame
round(2).to_html()
summary_df.round(2).to_html())) display(HTML(summary_df.
Count | Mean | Median | StdDev | Min | Max | |
---|---|---|---|---|---|---|
Price | 53940 | 3932.80 | 2401.00 | 3989.44 | 326.0 | 18823.00 |
Carat | 53940 | 0.80 | 0.70 | 0.47 | 0.2 | 5.01 |
X | 53940 | 5.73 | 5.70 | 1.12 | 0.0 | 10.74 |
Y | 53940 | 5.73 | 5.71 | 1.14 | 0.0 | 58.90 |
Z | 53940 | 3.54 | 3.53 | 0.71 | 0.0 | 31.80 |
Depth | 53940 | 61.75 | 61.80 | 1.43 | 43.0 | 79.00 |
Table | 53940 | 57.46 | 57.00 | 2.23 | 43.0 | 95.00 |
Summary statistics conditional on a categorical variable
Summarize the data is to compute the summary statistics conditional on a categorical variable.
- For example, we can compute the descriptive statistics of price for each level of the variable color.
# Group the data by 'color', calculate descriptive statistics for 'price', and round the results to 2 decimal places
= diamonds.groupby("color")["price"].describe().round(2).sort_values(by=('mean'), ascending=False)
price_summary_by_color
# Display the summary statistics
display(HTML(price_summary_by_color.to_html()))
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
color | ||||||||
J | 2808.0 | 5323.82 | 4438.19 | 335.0 | 1860.5 | 4234.0 | 7695.00 | 18710.0 |
I | 5422.0 | 5091.87 | 4722.39 | 334.0 | 1120.5 | 3730.0 | 7201.75 | 18823.0 |
H | 8304.0 | 4486.67 | 4215.94 | 337.0 | 984.0 | 3460.0 | 5980.25 | 18803.0 |
G | 11292.0 | 3999.14 | 4051.10 | 354.0 | 931.0 | 2242.0 | 6048.00 | 18818.0 |
F | 9542.0 | 3724.89 | 3784.99 | 342.0 | 982.0 | 2343.5 | 4868.25 | 18791.0 |
D | 6775.0 | 3169.95 | 3356.59 | 357.0 | 911.0 | 1838.0 | 4213.50 | 18693.0 |
E | 9797.0 | 3076.75 | 3344.16 | 326.0 | 882.0 | 1739.0 | 4003.00 | 18731.0 |
Summary statistics using pivot tables
Conditional summary statistics can be computed using pivot tables.
='color', values=['price'], aggfunc=['count','mean', 'std', 'min', 'max']).round(2).sort_values(by=('mean', 'price'), ascending=False) diamonds.pivot_table(index
count | mean | std | min | max | |
---|---|---|---|---|---|
price | price | price | price | price | |
color | |||||
J | 2808 | 5323.82 | 4438.19 | 335 | 18710 |
I | 5422 | 5091.87 | 4722.39 | 334 | 18823 |
H | 8304 | 4486.67 | 4215.94 | 337 | 18803 |
G | 11292 | 3999.14 | 4051.10 | 354 | 18818 |
F | 9542 | 3724.89 | 3784.99 | 342 | 18791 |
D | 6775 | 3169.95 | 3356.59 | 357 | 18693 |
E | 9797 | 3076.75 | 3344.16 | 326 | 18731 |
- This is efficient and neat