pandas.DataFrame.quantile Explained: Your Guide to Calculating Percentiles in DataFrames

What are quantiles?

Quantiles, also known as percentiles, divide a dataset into equal-sized groups. The most common quantile is the median, which represents the 50th percentile and splits the data in half. Other common quantiles include:

75th percentile (Q3): divides the data into fourths, with the upper quartile containing 75% of the values.
25th percentile (Q1): divides the data into fourths, with the lower quartile containing 25% of the values.

How does pandas.DataFrame.quantile work?

df.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')

interpolation (default: 'linear')
This option determines how to handle quantiles that fall between data points. Different options include 'linear', 'lower', 'higher', 'midpoint', and 'nearest'.
numeric_only (default: True)
By default, it only considers numeric columns. Set this to False to include datetime and timedelta data types.
axis (default: 0)
This specifies whether to calculate quantiles row-wise (axis=0) or column-wise (axis=1).
q (default: 0.5)
This is the quantile value you want to calculate. It can be a single float (e.g., 0.25 for Q1) or an array of quantiles.

Return value

If q is an array, the method returns a DataFrame where the index represents the quantiles, and the columns represent the original DataFrame's columns. Each cell contains the corresponding quantile value.
If q is a single float, the method returns a Series containing the quantile values for each column.

Example

import pandas as pd

data = {'age': [25, 30, 35, 40], 'salary': [50000, 70000, 80000, 90000]}
df = pd.DataFrame(data)

# Calculate median (50th percentile) for each column
median = df.quantile(0.5)
print(median)

# Calculate quartiles (25th and 75th percentile) for each column
quartiles = df.quantile([0.25, 0.75])
print(quartiles)

This will output the median and quartiles for the 'age' and 'salary' columns in your DataFrame.

Calculating multiple quantiles

import pandas as pd

data = {'price': [10, 15, 20, 25, 30, 35, 40], 'rating': [3.5, 4.2, 5.0, 4.8, 4.1, 3.9, 3.7]}
df = pd.DataFrame(data)

# Get percentiles 10th, 50th, and 90th for each column
percentiles = df.quantile([0.1, 0.5, 0.9])
print(percentiles)

This code calculates the 10th, 50th (median), and 90th percentiles for both 'price' and 'rating' columns and stores the results in a new DataFrame named percentiles.

Calculating quantiles on specific columns

import pandas as pd

data = {'age': [22, 28, 31, 35, 42], 'height': [170, 182, 175, 168, 185]}
df = pd.DataFrame(data)

# Get quartiles (25th and 75th) only for the 'age' column
quartiles_age = df['age'].quantile([0.25, 0.75])
print(quartiles_age)

This code calculates the quartiles only for the 'age' column and stores the results in a Series named quartiles_age.

Considering non-numeric data (with numeric_only=False)

import pandas as pd

data = {'date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-31', '2023-04-10']),
        'stock_price': [100, 120, 115, 130]}
df = pd.DataFrame(data)

# Get median for both 'date' (datetime) and 'stock_price'
median_all = df.quantile(0.5, numeric_only=False)
print(median_all)

This code includes a datetime column named 'date'. By setting numeric_only=False, it calculates the median for both the 'date' column (as the median date) and the 'stock_price' column.

Using numpy.percentile

However, it doesn't offer the same level of control over interpolation methods or handling of non-numeric data types.
It's generally faster than pandas.DataFrame.quantile for simple calculations, especially on smaller datasets.
numpy.percentile is a function from the NumPy library used for calculating percentiles (quartiles included) on NumPy arrays.

import pandas as pd
import numpy as np

data = {'price': [10, 15, 20, 25, 30]}
df = pd.DataFrame(data)

# Calculate median (50th percentile) using numpy
median_numpy = np.percentile(df['price'], 50)

# Calculate it using pandas for comparison
median_pandas = df['price'].quantile(0.5)

print(median_numpy, median_pandas)

Looping through columns

This approach is less efficient and less readable compared to pandas.DataFrame.quantile but might be suitable for simple cases.
If you only need basic quantile calculations on specific columns, you can loop through those columns and use built-in methods like sorted and slicing to calculate the desired quantiles.

Alternative libraries for data analysis

These libraries are typically used for specific big data scenarios and have steeper learning curves compared to pandas.
Libraries like Dask or Vaex are designed for handling larger datasets and might offer different functionalities for calculating quantiles.

Your familiarity with other libraries
The level of control you need over interpolation and data types
The size of your data (for performance considerations)