pandas.DataFrame.quantile Explained: Your Guide to Calculating Percentiles in DataFrames


What are quantiles?

Quantiles, also known as percentiles, divide a dataset into equal-sized groups. The most common quantile is the median, which represents the 50th percentile and splits the data in half. Other common quantiles include:

  • 75th percentile (Q3): divides the data into fourths, with the upper quartile containing 75% of the values.
  • 25th percentile (Q1): divides the data into fourths, with the lower quartile containing 25% of the values.

How does pandas.DataFrame.quantile work?

df.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
  • interpolation (default: 'linear')
    This option determines how to handle quantiles that fall between data points. Different options include 'linear', 'lower', 'higher', 'midpoint', and 'nearest'.
  • numeric_only (default: True)
    By default, it only considers numeric columns. Set this to False to include datetime and timedelta data types.
  • axis (default: 0)
    This specifies whether to calculate quantiles row-wise (axis=0) or column-wise (axis=1).
  • q (default: 0.5)
    This is the quantile value you want to calculate. It can be a single float (e.g., 0.25 for Q1) or an array of quantiles.

Return value

  • If q is an array, the method returns a DataFrame where the index represents the quantiles, and the columns represent the original DataFrame's columns. Each cell contains the corresponding quantile value.
  • If q is a single float, the method returns a Series containing the quantile values for each column.

Example

import pandas as pd

data = {'age': [25, 30, 35, 40], 'salary': [50000, 70000, 80000, 90000]}
df = pd.DataFrame(data)

# Calculate median (50th percentile) for each column
median = df.quantile(0.5)
print(median)

# Calculate quartiles (25th and 75th percentile) for each column
quartiles = df.quantile([0.25, 0.75])
print(quartiles)

This will output the median and quartiles for the 'age' and 'salary' columns in your DataFrame.



Calculating multiple quantiles

import pandas as pd

data = {'price': [10, 15, 20, 25, 30, 35, 40], 'rating': [3.5, 4.2, 5.0, 4.8, 4.1, 3.9, 3.7]}
df = pd.DataFrame(data)

# Get percentiles 10th, 50th, and 90th for each column
percentiles = df.quantile([0.1, 0.5, 0.9])
print(percentiles)

This code calculates the 10th, 50th (median), and 90th percentiles for both 'price' and 'rating' columns and stores the results in a new DataFrame named percentiles.

Calculating quantiles on specific columns

import pandas as pd

data = {'age': [22, 28, 31, 35, 42], 'height': [170, 182, 175, 168, 185]}
df = pd.DataFrame(data)

# Get quartiles (25th and 75th) only for the 'age' column
quartiles_age = df['age'].quantile([0.25, 0.75])
print(quartiles_age)

This code calculates the quartiles only for the 'age' column and stores the results in a Series named quartiles_age.

Considering non-numeric data (with numeric_only=False)

import pandas as pd

data = {'date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-31', '2023-04-10']),
        'stock_price': [100, 120, 115, 130]}
df = pd.DataFrame(data)

# Get median for both 'date' (datetime) and 'stock_price'
median_all = df.quantile(0.5, numeric_only=False)
print(median_all)

This code includes a datetime column named 'date'. By setting numeric_only=False, it calculates the median for both the 'date' column (as the median date) and the 'stock_price' column.



  1. Using numpy.percentile
  • However, it doesn't offer the same level of control over interpolation methods or handling of non-numeric data types.
  • It's generally faster than pandas.DataFrame.quantile for simple calculations, especially on smaller datasets.
  • numpy.percentile is a function from the NumPy library used for calculating percentiles (quartiles included) on NumPy arrays.
import pandas as pd
import numpy as np

data = {'price': [10, 15, 20, 25, 30]}
df = pd.DataFrame(data)

# Calculate median (50th percentile) using numpy
median_numpy = np.percentile(df['price'], 50)

# Calculate it using pandas for comparison
median_pandas = df['price'].quantile(0.5)

print(median_numpy, median_pandas)

Looping through columns

  • This approach is less efficient and less readable compared to pandas.DataFrame.quantile but might be suitable for simple cases.
  • If you only need basic quantile calculations on specific columns, you can loop through those columns and use built-in methods like sorted and slicing to calculate the desired quantiles.

Alternative libraries for data analysis

  • These libraries are typically used for specific big data scenarios and have steeper learning curves compared to pandas.
  • Libraries like Dask or Vaex are designed for handling larger datasets and might offer different functionalities for calculating quantiles.
  • Your familiarity with other libraries
  • The level of control you need over interpolation and data types
  • The size of your data (for performance considerations)