Visualizing Data Distributions with pandas.DataFrame.boxplot
Purpose
- It presents key statistics like quartiles (25th, 50th, and 75th percentiles), minimum and maximum values through the box and whiskers.
- Visualize the distribution of numerical data in your DataFrame columns.
How it works
- Input
You can calldf.boxplot()
on your DataFramedf
to create a boxplot for all numeric columns by default. Alternatively, you can specify the column names you want to visualize usingcolumn
parameter. - Grouping (Optional)
If you want to create boxplots for groups within your data, you can use theby
parameter. This allows you to compare distributions across different categories in a single plot.
Customization (Optional)
- Output
By default, the method returns the Matplotlib axes object where the plot is drawn (axes
). You can usereturn_type
parameter to get a dictionary containing the lines that make up the boxplot elements (boxes, caps, whiskers, medians) for further customization. - Appearance
You can control aspects like font size (fontsize
), label rotation (rot
), grid visibility (grid
), and figure size (figsize
) using the provided arguments.
Example 1: Basic Boxplot for all numeric columns
import pandas as pd
import matplotlib.pyplot as plt
# Create a sample DataFrame
data = {'Age': [25, 30, 22, 38, 40, 28],
'Height': [170, 182, 168, 175, 180, 172]}
df = pd.DataFrame(data)
# Create a boxplot
df.boxplot()
plt.title("Distribution of Age and Height")
plt.show()
Example 2: Boxplot for a specific column
# Create a boxplot for the 'Age' column only
df.boxplot(column=['Age'])
plt.title("Distribution of Age")
plt.show()
# Add a 'Gender' category
data['Gender'] = ['M', 'M', 'F', 'M', 'F', 'F']
df = pd.DataFrame(data)
# Create boxplots for Age by Gender
df.boxplot(by='Gender', column=['Age'])
plt.suptitle("Distribution of Age by Gender")
plt.show()
seaborn.boxplot
- Cons
Requires an additional library (seaborn) to be installed. - Pros
Provides a more visually appealing interface built on top of Matplotlib. Offers additional features like color palettes, violin plots (shows data density), and better legend handling.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Sample DataFrame (same as previous example)
data = {'Age': [25, 30, 22, 38, 40, 28],
'Height': [170, 182, 168, 175, 180, 172]}
df = pd.DataFrame(data)
# Create seaborn boxplot
sns.boxplot(x = "Gender", y = "Age", showmeans=True, data=df) # "showmeans" adds means to plot
plt.title("Distribution of Age by Gender (Seaborn)")
plt.show()
Matplotlib boxplot directly
- Cons
Requires more code compared to pandas or seaborn for a basic boxplot. - Pros
Offers full control over plot elements for customization.
import pandas as pd
import matplotlib.pyplot as plt
# Sample DataFrame (same as previous example)
data = {'Age': [25, 30, 22, 38, 40, 28],
'Height': [170, 182, 168, 175, 180, 172]}
df = pd.DataFrame(data)
# Create boxplot with Matplotlib
plt.boxplot(df[['Age']], vert=False) # 'vert=False' for horizontal boxplots
plt.title("Distribution of Age (Matplotlib)")
plt.show()
Other visualization libraries
- Bokeh
Another interactive visualization library with a focus on web-based plots. - Plotly
Interactive boxplots with zooming and panning capabilities.
These libraries offer different functionalities and trade-offs in terms of ease of use, customization, and interactivity. Choose the one that best suits your project requirements.
- ECDF plots
Show the empirical cumulative distribution function for comparing distributions. - Histograms
Visualize the frequency distribution of data points. - Violin plots
Show data density along with quartiles, useful for skewed distributions.