Understanding pandas.Categorical for Efficient Categorical Data Representation


pandas.Categorical

  • Maintains order (optional) for ordered categorical variables (e.g., shirt sizes).
  • Offers memory efficiency compared to string columns, especially for repeated categories.
  • Useful for encoding variables with qualitative or ordinal characteristics (e.g., colors, sizes, customer ratings).
  • Represents categorical data, a data type with a limited set of possible values (categories).

Key Points

  • Functionality
    • Reorder categories using reorder_categories().
    • Add or remove categories using add_categories() or remove_categories().
    • Access codes or categories using indexing (e.g., categorical[0] for code, categorical.categories[0] for category).
    • Perform certain operations like sorting and counting while preserving categorical information.
  • Properties
    • categories: Array-like containing the valid categories.
    • codes: Integer codes corresponding to each category in the data.
    • ordered: Boolean indicating if the categories have a specific order (default False).
  • Creation
    • From an existing array-like object (list, NumPy array) using pd.Categorical().
    • By specifying dtype="category" when creating a pandas Series.
    • Converting a Series or DataFrame column to categorical using series.astype("category").

Relationship to pandas Arrays

  • This combination leverages the strengths of both data types:
    • Categorical data representation with memory efficiency.
    • Potential performance benefits from vectorized operations using pandas arrays.
  • When a pandas Series or column has a categorical dtype, pandas might internally use a pandas array for optimized storage and handling of categorical data.
  • pandas.Categorical is not directly a pandas array itself, but it can be used with pandas arrays under the hood for more efficient storage and operations.
import pandas as pd

data = ["apple", "banana", "orange", "apple"]
categories = ["apple", "banana", "citrus"]  # Include "citrus" for demonstration

# Create a categorical Series
categorical = pd.Categorical(data, categories=categories, ordered=True)

print(categorical)
# Output: [apple, banana, orange, apple]
#          Categories (ordered)  ['apple', 'banana', 'citrus']
#          Codes                 [0,        1,          2,        0]

# Access codes and categories
code = categorical[0]
category_name = categorical.categories[1]

print(f"Code for 'apple': {code}")  # Output: Code for 'apple': 0
print(f"Category at index 1: {category_name}")  # Output: Category at index 1: banana

# Reorder categories (demonstrates mutability)
categorical.reorder_categories(["banana", "apple", "citrus"], inplace=True)
print(categorical.categories)  # Output: Categories (ordered)  ['banana', 'apple', 'citrus']


Handling Missing Values

import pandas as pd
import numpy as np

data = ["apple", "banana", np.nan, "apple"]
categories = ["apple", "banana", "orange"]

categorical = pd.Categorical(data, categories=categories)

# Check for missing values
print(categorical.isnull())  # Output: [False  True  False False]

# Replace missing values with a specific category (or code)
categorical = categorical.fillna("missing")
print(categorical)
# Output: [apple, missing, apple, apple]
#          Categories (4, object)  ['apple', 'banana', 'orange', 'missing']
#          Codes                 [0,        3,          0,        0]

Using with DataFrames

import pandas as pd

data = {"fruit": ["apple", "banana", "orange", "apple"],
        "size": ["large", "medium", "small", "medium"]}
categories_fruit = ["apple", "banana", "orange"]
categories_size = ["small", "medium", "large"]

df = pd.DataFrame({"fruit": pd.Categorical(data["fruit"], categories=categories_fruit),
                   "size": pd.Categorical(data["size"], categories=categories_size)})

print(df)
# Output:           fruit  size
# 0          apple  large
# 1        banana  medium
# 2        orange   small
# 3          apple  medium

# Group by categorical columns and calculate statistics
fruit_counts = df.groupby("fruit").size()
print(fruit_counts)  # Output: fruit
# apple     2
# banana     1
# orange     1
# dtype: int64

Performing Operations

import pandas as pd

data = ["apple", "banana", "apple", "orange"]
categories = ["apple", "banana", "citrus"]

categorical = pd.Categorical(data, categories=categories)

# Get the frequency of each category
category_counts = categorical.value_counts()
print(category_counts)  # Output: apple    2
#                     banana     1
#                     citrus     0
#                     dtype: int64

# Concatenate categoricals (assuming compatible categories)
cat1 = pd.Categorical(["apple", "banana"])
cat2 = pd.Categorical(["orange", "apple"])
combined = pd.concat([cat1, cat2])
print(combined)
# Output: [apple, banana, orange, apple]
#          Categories (3, object)  ['apple', 'banana', 'orange']
#          Codes                 [0,        1,          2,        0]

Using with String Methods (Limited Support)

import pandas as pd

data = ["apple", "Banana", "apple", "Orange"]

categorical = pd.Categorical(data)

# Limited string methods work (may not preserve categorical information)
print(categorical.str.lower())  # Output: [apple, banana, apple, orange]

Remember that some string methods might not fully work with categoricals or might not preserve the categorical information. It's generally recommended to use methods specifically designed for categorical data when possible.



String Columns

  • Not ideal for performing categorical-specific operations.
  • Can be inefficient for repeated categories due to memory duplication.
  • Stores each category as a string.
  • Simplest approach.

Object dtype

  • Not specifically designed for categorical data, making operations like sorting or counting less efficient.
  • Can hold various data types within the same column.
  • Similar to string columns but more generic.

Dictionary Mapping

  • Custom logic needed for operations like sorting or counting.
  • Less efficient for memory due to separate storage of codes and names.
  • Create a dictionary mapping integer codes to category names.

Custom Enum Classes (Python 3.4+)

  • Requires custom code for data manipulation and may not integrate seamlessly with pandas operations.
  • More explicit representation of categories.
  • Define an enum class to represent the categories.

When to Choose pandas.Categorical

  • You can leverage built-in methods for sorting, counting, and other categorical-specific operations.
  • It offers memory efficiency compared to string columns, especially for repeated categories.
  • Use pandas.Categorical when you have categorical data with a limited set of possible values.
  • Memory usage is a critical concern, and you're willing to write custom code for handling categories (e.g., dictionary mapping).
  • You need more generic data storage capabilities beyond categories (e.g., object dtype).