Understanding pandas.Categorical for Efficient Categorical Data Representation
pandas.Categorical
- Maintains order (optional) for ordered categorical variables (e.g., shirt sizes).
- Offers memory efficiency compared to string columns, especially for repeated categories.
- Useful for encoding variables with qualitative or ordinal characteristics (e.g., colors, sizes, customer ratings).
- Represents categorical data, a data type with a limited set of possible values (categories).
Key Points
- Functionality
- Reorder categories using
reorder_categories()
. - Add or remove categories using
add_categories()
orremove_categories()
. - Access codes or categories using indexing (e.g.,
categorical[0]
for code,categorical.categories[0]
for category). - Perform certain operations like sorting and counting while preserving categorical information.
- Reorder categories using
- Properties
categories
: Array-like containing the valid categories.codes
: Integer codes corresponding to each category in the data.ordered
: Boolean indicating if the categories have a specific order (defaultFalse
).
- Creation
- From an existing array-like object (list, NumPy array) using
pd.Categorical()
. - By specifying
dtype="category"
when creating a pandas Series. - Converting a Series or DataFrame column to categorical using
series.astype("category")
.
- From an existing array-like object (list, NumPy array) using
Relationship to pandas Arrays
- This combination leverages the strengths of both data types:
- Categorical data representation with memory efficiency.
- Potential performance benefits from vectorized operations using pandas arrays.
- When a pandas Series or column has a categorical dtype, pandas might internally use a pandas array for optimized storage and handling of categorical data.
pandas.Categorical
is not directly a pandas array itself, but it can be used with pandas arrays under the hood for more efficient storage and operations.
import pandas as pd
data = ["apple", "banana", "orange", "apple"]
categories = ["apple", "banana", "citrus"] # Include "citrus" for demonstration
# Create a categorical Series
categorical = pd.Categorical(data, categories=categories, ordered=True)
print(categorical)
# Output: [apple, banana, orange, apple]
# Categories (ordered) ['apple', 'banana', 'citrus']
# Codes [0, 1, 2, 0]
# Access codes and categories
code = categorical[0]
category_name = categorical.categories[1]
print(f"Code for 'apple': {code}") # Output: Code for 'apple': 0
print(f"Category at index 1: {category_name}") # Output: Category at index 1: banana
# Reorder categories (demonstrates mutability)
categorical.reorder_categories(["banana", "apple", "citrus"], inplace=True)
print(categorical.categories) # Output: Categories (ordered) ['banana', 'apple', 'citrus']
Handling Missing Values
import pandas as pd
import numpy as np
data = ["apple", "banana", np.nan, "apple"]
categories = ["apple", "banana", "orange"]
categorical = pd.Categorical(data, categories=categories)
# Check for missing values
print(categorical.isnull()) # Output: [False True False False]
# Replace missing values with a specific category (or code)
categorical = categorical.fillna("missing")
print(categorical)
# Output: [apple, missing, apple, apple]
# Categories (4, object) ['apple', 'banana', 'orange', 'missing']
# Codes [0, 3, 0, 0]
Using with DataFrames
import pandas as pd
data = {"fruit": ["apple", "banana", "orange", "apple"],
"size": ["large", "medium", "small", "medium"]}
categories_fruit = ["apple", "banana", "orange"]
categories_size = ["small", "medium", "large"]
df = pd.DataFrame({"fruit": pd.Categorical(data["fruit"], categories=categories_fruit),
"size": pd.Categorical(data["size"], categories=categories_size)})
print(df)
# Output: fruit size
# 0 apple large
# 1 banana medium
# 2 orange small
# 3 apple medium
# Group by categorical columns and calculate statistics
fruit_counts = df.groupby("fruit").size()
print(fruit_counts) # Output: fruit
# apple 2
# banana 1
# orange 1
# dtype: int64
Performing Operations
import pandas as pd
data = ["apple", "banana", "apple", "orange"]
categories = ["apple", "banana", "citrus"]
categorical = pd.Categorical(data, categories=categories)
# Get the frequency of each category
category_counts = categorical.value_counts()
print(category_counts) # Output: apple 2
# banana 1
# citrus 0
# dtype: int64
# Concatenate categoricals (assuming compatible categories)
cat1 = pd.Categorical(["apple", "banana"])
cat2 = pd.Categorical(["orange", "apple"])
combined = pd.concat([cat1, cat2])
print(combined)
# Output: [apple, banana, orange, apple]
# Categories (3, object) ['apple', 'banana', 'orange']
# Codes [0, 1, 2, 0]
Using with String Methods (Limited Support)
import pandas as pd
data = ["apple", "Banana", "apple", "Orange"]
categorical = pd.Categorical(data)
# Limited string methods work (may not preserve categorical information)
print(categorical.str.lower()) # Output: [apple, banana, apple, orange]
Remember that some string methods might not fully work with categoricals or might not preserve the categorical information. It's generally recommended to use methods specifically designed for categorical data when possible.
String Columns
- Not ideal for performing categorical-specific operations.
- Can be inefficient for repeated categories due to memory duplication.
- Stores each category as a string.
- Simplest approach.
Object dtype
- Not specifically designed for categorical data, making operations like sorting or counting less efficient.
- Can hold various data types within the same column.
- Similar to string columns but more generic.
Dictionary Mapping
- Custom logic needed for operations like sorting or counting.
- Less efficient for memory due to separate storage of codes and names.
- Create a dictionary mapping integer codes to category names.
Custom Enum Classes (Python 3.4+)
- Requires custom code for data manipulation and may not integrate seamlessly with pandas operations.
- More explicit representation of categories.
- Define an enum class to represent the categories.
When to Choose pandas.Categorical
- You can leverage built-in methods for sorting, counting, and other categorical-specific operations.
- It offers memory efficiency compared to string columns, especially for repeated categories.
- Use
pandas.Categorical
when you have categorical data with a limited set of possible values.
- Memory usage is a critical concern, and you're willing to write custom code for handling categories (e.g., dictionary mapping).
- You need more generic data storage capabilities beyond categories (e.g., object dtype).