Visualizing Data Distribution with pandas.DataFrame.plot.box


  • box: This is a method within the plot attribute specifically designed to generate box plots. Box plots are visualizations that depict the distribution of numerical data through quartiles. They show the median (Q2), the first quartile (Q1), and the third quartile (Q3) of the data. The box extends from Q1 to Q3, with lines representing the minimum and maximum values (or outliers) outside the box.

  • plot: This is an attribute of the DataFrame that provides plotting functionalities. It allows you to create various charts and visualizations from your DataFrame.

  • pandas.DataFrame: This refers to the core data structure in pandas, a two-dimensional labeled data table. It holds data in a tabular format with rows and columns.

When you use df.plot.box(), pandas automatically creates a box plot for each numerical column in the DataFrame df.

Here are some key points to remember:

  • You can customize the box plot further using additional arguments within the plot.box method. These arguments allow you to specify things like the title of the plot, labels for the axes, and colors for the boxes.
  • By default, the box plots are created for all numerical columns in the DataFrame.
  • This method works best with numerical data. If you have categorical data in your columns, it might not be suitable.


Example 1: Basic Box Plot

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame with numerical columns
data = {'Age': [25, 30, 22, 38, 28, 40],
        'Height': [170, 182, 168, 175, 172, 180]}
df = pd.DataFrame(data)

# Create a box plot for all numerical columns
df.plot.box()

# Display the plot (optional, you might need this depending on your environment)
plt.show()

This code creates a DataFrame with two columns, "Age" and "Height", containing numerical data. Then, it uses df.plot.box() to generate a box plot for both columns on the same graph. Finally, it displays the plot using plt.show().

import pandas as pd
import matplotlib.pyplot as plt

# Sample data with a categorical column
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
        'Age': [25, 30, 22, 38, 28, 40],
        'Score': [85, 92, 78, 65, 90, 82]}
df = pd.DataFrame(data)

# Create separate box plots for Age by Name category
df.boxplot(by='Name', column='Age', vert=False)

# Customize the plot with title, labels, and grid
plt.title('Age Distribution by Name')
plt.xlabel('Age')
plt.ylabel('Name')
plt.grid(True)

# Display the plot
plt.show()


  • Bokeh: Another library focused on interactive visualizations. Bokeh allows for similar functionalities to Plotly, including:

    • Interactivity: Users can interact with the plot to explore the data.
    • Customization: Offers extensive customization options for the box plot appearance.
  • Plotly: This library allows you to create interactive visualizations. With Plotly, you can:

    • Interactivity: Users can zoom, pan, and hover over elements for additional information.
    • Customization: Plotly offers a wide range of options to customize every aspect of the box plot.
    • Web-based: The resulting visualizations are web-based and can be easily shared online.
  • Seaborn: This popular library built on top of Matplotlib provides a high-level interface for creating statistical graphics. Seaborn offers a function called sns.boxplot that allows for:

    • Grouping: You can easily group data by a categorical variable and create box plots for each group side-by-side.
    • Coloring: Customize the colors of the boxes and other plot elements for better visual distinction.
    • Statistical aesthetics: Seaborn provides a more aesthetically pleasing default style for box plots.

Choosing the right alternative depends on your needs

  • Plotly/Bokeh: Perfect if you need interactive visualizations for exploration and sharing online.
  • Seaborn: Ideal for creating publication-quality box plots with clear grouping and aesthetics.