import pandas as pd
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)Python | Data Summary
This notebook gives an illustration of how to use Python to summarize data.
The data used in this example is the diamond dataset, which is available in the Seaborn library.
The dataset
diamonds = sns.load_dataset("diamonds", cache=True, data_home=None)
diamonds.head()| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| Variable | Description | Values |
|---|---|---|
| price | price in US dollars | $326-$18,823 |
| carat | weight of the diamond | 0.2-5.01 |
| cut | quality of the cut | Fair, Good, Very Good, Premium, Ideal |
| color | diamond color | J (worst) to D (best) |
| clarity | measurement of how clear the diamond is | I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best) |
| x | length in mm | 0-10.74 |
| y | width in mm | 0-58.9 |
| z | depth in mm | 0-31.8 |
| depth | total depth percentage | 43-79 |
| table | width of top of diamond relative to widest point | 43-95 |
# 1. How many rows and columns are there in diamonds DataFrame?
diamonds.shape[0]
# 2. What are the column names in diamonds DataFrame?
diamonds.columnsIndex(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
'z'],
dtype='object')
Description of the dataset
The dataset contains 10 variables and 53,940 observations. The variables are a mix of numerical and categorical variables.
- Numerical variables: price, carat, x, y, z, depth, table
- Categorical variables: cut, color, clarity
describe() is a method that provides a summary of the numerical variables in the dataset. It returns the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of the numerical variables.
html_table = diamonds.describe().round(2).to_html(index=True)
display(html_table)| carat | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|
| count | 53940.00 | 53940.00 | 53940.00 | 53940.00 | 53940.00 | 53940.00 | 53940.00 |
| mean | 0.80 | 61.75 | 57.46 | 3932.80 | 5.73 | 5.73 | 3.54 |
| std | 0.47 | 1.43 | 2.23 | 3989.44 | 1.12 | 1.14 | 0.71 |
| min | 0.20 | 43.00 | 43.00 | 326.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.40 | 61.00 | 56.00 | 950.00 | 4.71 | 4.72 | 2.91 |
| 50% | 0.70 | 61.80 | 57.00 | 2401.00 | 5.70 | 5.71 | 3.53 |
| 75% | 1.04 | 62.50 | 59.00 | 5324.25 | 6.54 | 6.54 | 4.04 |
| max | 5.01 | 79.00 | 95.00 | 18823.00 | 10.74 | 58.90 | 31.80 |
- We prefer to visualize the variables in rows rather than columns. This is because it is easier to compare the values of the variables when they are in rows.
stats = diamonds.describe()
stats['stats'] = stats.index
melt = stats.melt(id_vars='stats', var_name='Variable', value_name='Value')
melt.pivot(index='Variable', columns='stats', values='Value').round(2)| stats | 25% | 50% | 75% | count | max | mean | min | std |
|---|---|---|---|---|---|---|---|---|
| Variable | ||||||||
| carat | 0.40 | 0.70 | 1.04 | 53940.0 | 5.01 | 0.80 | 0.2 | 0.47 |
| depth | 61.00 | 61.80 | 62.50 | 53940.0 | 79.00 | 61.75 | 43.0 | 1.43 |
| price | 950.00 | 2401.00 | 5324.25 | 53940.0 | 18823.00 | 3932.80 | 326.0 | 3989.44 |
| table | 56.00 | 57.00 | 59.00 | 53940.0 | 95.00 | 57.46 | 43.0 | 2.23 |
| x | 4.71 | 5.70 | 6.54 | 53940.0 | 10.74 | 5.73 | 0.0 | 1.12 |
| y | 4.72 | 5.71 | 6.54 | 53940.0 | 58.90 | 5.73 | 0.0 | 1.14 |
| z | 2.91 | 3.53 | 4.04 | 53940.0 | 31.80 | 3.54 | 0.0 | 0.71 |
For the categorical variables: cut, color, clarity we compute the frequency table.
from IPython.display import display, HTML
# Convert the value counts to a DataFrame for nicer formatting
cut_counts_df = diamonds['cut'].value_counts().reset_index()
cut_counts_df.columns = ['Cut', 'Frequency']
# Convert the DataFrame to an HTML table and display it in Jupyter Notebook
html_table = cut_counts_df.to_html(index=False)
display(HTML(html_table))| Cut | Frequency |
|---|---|
| Ideal | 21551 |
| Premium | 13791 |
| Very Good | 12082 |
| Good | 4906 |
| Fair | 1610 |
# Convert the color value counts to a DataFrame for better HTML formatting
color_counts_df = diamonds['color'].value_counts().reset_index()
color_counts_df.columns = ['Color', 'Frequency']
# Convert the DataFrame to an HTML table
html_table = color_counts_df.to_html(index=False)
display(HTML(html_table))| Color | Frequency |
|---|---|
| G | 11292 |
| E | 9797 |
| F | 9542 |
| H | 8304 |
| D | 6775 |
| I | 5422 |
| J | 2808 |
# Convert the clarity value counts to a DataFrame and then to HTML
clarity_counts_html = diamonds['clarity'].value_counts().to_frame().reset_index().rename(columns={'index': 'Clarity', 'clarity': 'Frequency'}).to_html(index=False)
# Display the HTML table
display(HTML(clarity_counts_html))| Frequency | count |
|---|---|
| SI1 | 13065 |
| VS2 | 12258 |
| SI2 | 9194 |
| VS1 | 8171 |
| VVS2 | 5066 |
| VVS1 | 3655 |
| IF | 1790 |
| I1 | 741 |
From absolute frequencies we compute the relative frequencies.
diamonds['clarity'].value_counts(normalize=True).to_frame().round(3)| proportion | |
|---|---|
| clarity | |
| SI1 | 0.242 |
| VS2 | 0.227 |
| SI2 | 0.170 |
| VS1 | 0.151 |
| VVS2 | 0.094 |
| VVS1 | 0.068 |
| IF | 0.033 |
| I1 | 0.014 |
We put absolute frequency and relative frequency next to each other to make it easier to compare the values.
pd.merge(
diamonds['clarity'].value_counts(),
diamonds['clarity'].value_counts(normalize=True).round(3),
on="clarity",
how="inner"
)| count | proportion | |
|---|---|---|
| clarity | ||
| SI1 | 13065 | 0.242 |
| VS2 | 12258 | 0.227 |
| SI2 | 9194 | 0.170 |
| VS1 | 8171 | 0.151 |
| VVS2 | 5066 | 0.094 |
| VVS1 | 3655 | 0.068 |
| IF | 1790 | 0.033 |
| I1 | 741 | 0.014 |
Create variables from existing variables
We can create new variables from existing variables. Several functions can be used to create new variables.
diamonds.head()| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
Mathematical operations
From the variable price we can create a new variable price_discount as follows:
\(price\_discount = price - price*0.1\)
diamonds["price_discount"] = diamonds["price"] - diamonds["price"] * 0.1
diamonds.head()| carat | cut | color | clarity | depth | table | price | x | y | z | price_discount | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 293.4 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 293.4 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 294.3 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 300.6 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 301.5 |
Conditional operations
We can condition the new variable on values in other variables as well.
For example, we can create a new variable Top as follows:
- If the cut is Ideal or Premium, then Top is Yes; otherwise, Top is No.
def func(mod):
if mod['cut'] == 'Ideal' or mod['cut'] == 'Premium':
return 'Yes'
else:
return 'No'
diamonds['Top'] = diamonds.apply(func, axis=1)
diamonds.head()| carat | cut | color | clarity | depth | table | price | x | y | z | price_discount | Top | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 293.4 | Yes |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 293.4 | Yes |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 294.3 | No |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 300.6 | Yes |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 301.5 | No |
Summary statistics
We can apply several functions to summarize the data.
- count(): count the number of observations
- mean(): compute the mean
- median(): compute the median
- std(): compute the standard deviation
- min(): compute the minimum value
- max(): compute the maximum value
columns_of_interest = ['price', 'carat', 'x', 'y', 'z', 'depth', 'table']
summary_data = []
for column in columns_of_interest:
summary_data.append([
diamonds[column].count(),
diamonds[column].mean(),
diamonds[column].median(),
diamonds[column].std(),
diamonds[column].min(),
diamonds[column].max()
])
summary_df = pd.DataFrame(
summary_data,
columns=['Count', 'Mean', 'Median', 'StdDev', 'Min', 'Max'],
index=["Price", "Carat", "X", "Y", "Z", "Depth", "Table"]
)
# Display the DataFrame
summary_df.round(2).to_html()
display(HTML(summary_df.round(2).to_html()))| Count | Mean | Median | StdDev | Min | Max | |
|---|---|---|---|---|---|---|
| Price | 53940 | 3932.80 | 2401.00 | 3989.44 | 326.0 | 18823.00 |
| Carat | 53940 | 0.80 | 0.70 | 0.47 | 0.2 | 5.01 |
| X | 53940 | 5.73 | 5.70 | 1.12 | 0.0 | 10.74 |
| Y | 53940 | 5.73 | 5.71 | 1.14 | 0.0 | 58.90 |
| Z | 53940 | 3.54 | 3.53 | 0.71 | 0.0 | 31.80 |
| Depth | 53940 | 61.75 | 61.80 | 1.43 | 43.0 | 79.00 |
| Table | 53940 | 57.46 | 57.00 | 2.23 | 43.0 | 95.00 |
Summary statistics conditional on a categorical variable
Summarize the data is to compute the summary statistics conditional on a categorical variable.
- For example, we can compute the descriptive statistics of price for each level of the variable color.
# Group the data by 'color', calculate descriptive statistics for 'price', and round the results to 2 decimal places
price_summary_by_color = diamonds.groupby("color")["price"].describe().round(2).sort_values(by=('mean'), ascending=False)
# Display the summary statistics
display(HTML(price_summary_by_color.to_html()))| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| color | ||||||||
| J | 2808.0 | 5323.82 | 4438.19 | 335.0 | 1860.5 | 4234.0 | 7695.00 | 18710.0 |
| I | 5422.0 | 5091.87 | 4722.39 | 334.0 | 1120.5 | 3730.0 | 7201.75 | 18823.0 |
| H | 8304.0 | 4486.67 | 4215.94 | 337.0 | 984.0 | 3460.0 | 5980.25 | 18803.0 |
| G | 11292.0 | 3999.14 | 4051.10 | 354.0 | 931.0 | 2242.0 | 6048.00 | 18818.0 |
| F | 9542.0 | 3724.89 | 3784.99 | 342.0 | 982.0 | 2343.5 | 4868.25 | 18791.0 |
| D | 6775.0 | 3169.95 | 3356.59 | 357.0 | 911.0 | 1838.0 | 4213.50 | 18693.0 |
| E | 9797.0 | 3076.75 | 3344.16 | 326.0 | 882.0 | 1739.0 | 4003.00 | 18731.0 |
Summary statistics using pivot tables
Conditional summary statistics can be computed using pivot tables.
diamonds.pivot_table(index='color', values=['price'], aggfunc=['count','mean', 'std', 'min', 'max']).round(2).sort_values(by=('mean', 'price'), ascending=False)| count | mean | std | min | max | |
|---|---|---|---|---|---|
| price | price | price | price | price | |
| color | |||||
| J | 2808 | 5323.82 | 4438.19 | 335 | 18710 |
| I | 5422 | 5091.87 | 4722.39 | 334 | 18823 |
| H | 8304 | 4486.67 | 4215.94 | 337 | 18803 |
| G | 11292 | 3999.14 | 4051.10 | 354 | 18818 |
| F | 9542 | 3724.89 | 3784.99 | 342 | 18791 |
| D | 6775 | 3169.95 | 3356.59 | 357 | 18693 |
| E | 9797 | 3076.75 | 3344.16 | 326 | 18731 |
- This is efficient and neat