Visualizing Data Distributions with pandas.DataFrame.boxplot


Purpose

  • It presents key statistics like quartiles (25th, 50th, and 75th percentiles), minimum and maximum values through the box and whiskers.
  • Visualize the distribution of numerical data in your DataFrame columns.

How it works

  1. Input
    You can call df.boxplot() on your DataFrame df to create a boxplot for all numeric columns by default. Alternatively, you can specify the column names you want to visualize using column parameter.
  2. Grouping (Optional)
    If you want to create boxplots for groups within your data, you can use the by parameter. This allows you to compare distributions across different categories in a single plot.

Customization (Optional)

  • Output
    By default, the method returns the Matplotlib axes object where the plot is drawn (axes). You can use return_type parameter to get a dictionary containing the lines that make up the boxplot elements (boxes, caps, whiskers, medians) for further customization.
  • Appearance
    You can control aspects like font size (fontsize), label rotation (rot), grid visibility (grid), and figure size (figsize) using the provided arguments.


Example 1: Basic Boxplot for all numeric columns

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
data = {'Age': [25, 30, 22, 38, 40, 28],
        'Height': [170, 182, 168, 175, 180, 172]}
df = pd.DataFrame(data)

# Create a boxplot
df.boxplot()
plt.title("Distribution of Age and Height")
plt.show()

Example 2: Boxplot for a specific column

# Create a boxplot for the 'Age' column only
df.boxplot(column=['Age'])
plt.title("Distribution of Age")
plt.show()
# Add a 'Gender' category
data['Gender'] = ['M', 'M', 'F', 'M', 'F', 'F']
df = pd.DataFrame(data)

# Create boxplots for Age by Gender
df.boxplot(by='Gender', column=['Age'])
plt.suptitle("Distribution of Age by Gender")
plt.show()


seaborn.boxplot

  • Cons
    Requires an additional library (seaborn) to be installed.
  • Pros
    Provides a more visually appealing interface built on top of Matplotlib. Offers additional features like color palettes, violin plots (shows data density), and better legend handling.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame (same as previous example)
data = {'Age': [25, 30, 22, 38, 40, 28],
        'Height': [170, 182, 168, 175, 180, 172]}
df = pd.DataFrame(data)

# Create seaborn boxplot
sns.boxplot(x = "Gender", y = "Age", showmeans=True, data=df)  # "showmeans" adds means to plot
plt.title("Distribution of Age by Gender (Seaborn)")
plt.show()

Matplotlib boxplot directly

  • Cons
    Requires more code compared to pandas or seaborn for a basic boxplot.
  • Pros
    Offers full control over plot elements for customization.
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame (same as previous example)
data = {'Age': [25, 30, 22, 38, 40, 28],
        'Height': [170, 182, 168, 175, 180, 172]}
df = pd.DataFrame(data)

# Create boxplot with Matplotlib
plt.boxplot(df[['Age']], vert=False)  # 'vert=False' for horizontal boxplots
plt.title("Distribution of Age (Matplotlib)")
plt.show()

Other visualization libraries

  • Bokeh
    Another interactive visualization library with a focus on web-based plots.
  • Plotly
    Interactive boxplots with zooming and panning capabilities.

These libraries offer different functionalities and trade-offs in terms of ease of use, customization, and interactivity. Choose the one that best suits your project requirements.

  • ECDF plots
    Show the empirical cumulative distribution function for comparing distributions.
  • Histograms
    Visualize the frequency distribution of data points.
  • Violin plots
    Show data density along with quartiles, useful for skewed distributions.