pandas.DataFrame.quantile Explained: Your Guide to Calculating Percentiles in DataFrames
What are quantiles?
Quantiles, also known as percentiles, divide a dataset into equal-sized groups. The most common quantile is the median, which represents the 50th percentile and splits the data in half. Other common quantiles include:
- 75th percentile (Q3): divides the data into fourths, with the upper quartile containing 75% of the values.
- 25th percentile (Q1): divides the data into fourths, with the lower quartile containing 25% of the values.
How does pandas.DataFrame.quantile
work?
df.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
- interpolation (default: 'linear')
This option determines how to handle quantiles that fall between data points. Different options include 'linear', 'lower', 'higher', 'midpoint', and 'nearest'. - numeric_only (default: True)
By default, it only considers numeric columns. Set this to False to include datetime and timedelta data types. - axis (default: 0)
This specifies whether to calculate quantiles row-wise (axis=0) or column-wise (axis=1). - q (default: 0.5)
This is the quantile value you want to calculate. It can be a single float (e.g., 0.25 for Q1) or an array of quantiles.
Return value
- If
q
is an array, the method returns a DataFrame where the index represents the quantiles, and the columns represent the original DataFrame's columns. Each cell contains the corresponding quantile value. - If
q
is a single float, the method returns a Series containing the quantile values for each column.
Example
import pandas as pd
data = {'age': [25, 30, 35, 40], 'salary': [50000, 70000, 80000, 90000]}
df = pd.DataFrame(data)
# Calculate median (50th percentile) for each column
median = df.quantile(0.5)
print(median)
# Calculate quartiles (25th and 75th percentile) for each column
quartiles = df.quantile([0.25, 0.75])
print(quartiles)
This will output the median and quartiles for the 'age' and 'salary' columns in your DataFrame.
Calculating multiple quantiles
import pandas as pd
data = {'price': [10, 15, 20, 25, 30, 35, 40], 'rating': [3.5, 4.2, 5.0, 4.8, 4.1, 3.9, 3.7]}
df = pd.DataFrame(data)
# Get percentiles 10th, 50th, and 90th for each column
percentiles = df.quantile([0.1, 0.5, 0.9])
print(percentiles)
This code calculates the 10th, 50th (median), and 90th percentiles for both 'price' and 'rating' columns and stores the results in a new DataFrame named percentiles
.
Calculating quantiles on specific columns
import pandas as pd
data = {'age': [22, 28, 31, 35, 42], 'height': [170, 182, 175, 168, 185]}
df = pd.DataFrame(data)
# Get quartiles (25th and 75th) only for the 'age' column
quartiles_age = df['age'].quantile([0.25, 0.75])
print(quartiles_age)
This code calculates the quartiles only for the 'age' column and stores the results in a Series named quartiles_age
.
Considering non-numeric data (with numeric_only=False)
import pandas as pd
data = {'date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-31', '2023-04-10']),
'stock_price': [100, 120, 115, 130]}
df = pd.DataFrame(data)
# Get median for both 'date' (datetime) and 'stock_price'
median_all = df.quantile(0.5, numeric_only=False)
print(median_all)
This code includes a datetime column named 'date'. By setting numeric_only=False
, it calculates the median for both the 'date' column (as the median date) and the 'stock_price' column.
- Using numpy.percentile
- However, it doesn't offer the same level of control over interpolation methods or handling of non-numeric data types.
- It's generally faster than
pandas.DataFrame.quantile
for simple calculations, especially on smaller datasets. numpy.percentile
is a function from the NumPy library used for calculating percentiles (quartiles included) on NumPy arrays.
import pandas as pd
import numpy as np
data = {'price': [10, 15, 20, 25, 30]}
df = pd.DataFrame(data)
# Calculate median (50th percentile) using numpy
median_numpy = np.percentile(df['price'], 50)
# Calculate it using pandas for comparison
median_pandas = df['price'].quantile(0.5)
print(median_numpy, median_pandas)
Looping through columns
- This approach is less efficient and less readable compared to
pandas.DataFrame.quantile
but might be suitable for simple cases. - If you only need basic quantile calculations on specific columns, you can loop through those columns and use built-in methods like
sorted
and slicing to calculate the desired quantiles.
Alternative libraries for data analysis
- These libraries are typically used for specific big data scenarios and have steeper learning curves compared to pandas.
- Libraries like
Dask
orVaex
are designed for handling larger datasets and might offer different functionalities for calculating quantiles.
- Your familiarity with other libraries
- The level of control you need over interpolation and data types
- The size of your data (for performance considerations)