Understanding pandas.Series.nunique: Your Guide to Counting Unique Values

Functionality

By default, it excludes missing values (represented as NaN) from the calculation.
It examines the values in the Series and returns an integer representing the count of distinct values.

Benefits

Useful for tasks like identifying unique categories, filtering duplicates, or analyzing data distribution.
Helps understand the variety of data present in a Series.

Example

import pandas as pd

# Create a Series with some data
data = ["apple", "banana", "orange", "apple", "pear"]
fruits = pd.Series(data)

# Count the number of unique fruits
unique_fruits = fruits.nunique()

print(unique_fruits)  # Output: 4

In this example, nunique counts the number of unique fruits ("apple", "banana", "orange", "pear"), excluding duplicates, and returns 4.

nunique is similar to the len function, but len counts all elements (including duplicates) while nunique only counts unique ones.
You can control how missing values are handled using the dropna parameter:
- dropna=True (default): Excludes missing values.
- dropna=False: Includes missing values in the count.

Including Missing Values

import pandas as pd

# Create a Series with missing values
data = ["apple", "banana", None, "orange", "apple"]
fruits = pd.Series(data)

# Count unique elements including missing values
unique_fruits_with_missing = fruits.nunique(dropna=False)

print(unique_fruits_with_missing)  # Output: 3

This code includes None (missing value) in the count by setting dropna=False.

Counting Unique Integers

# Create a Series with integer data
numbers = pd.Series([1, 2, 2, 3, 1, 4])

# Count unique numbers
unique_numbers = numbers.nunique()

print(unique_numbers)  # Output: 4

This example shows nunique working with numerical data (integers) and excluding duplicates.

Using nunique with DataFrames

import pandas as pd

# Create a DataFrame with a Series as a column
data = {'fruits': ["apple", "banana", "orange", "apple", "pear"],
        'numbers': [1, 2, 2, 3, 1]}
df = pd.DataFrame(data)

# Count unique fruits
unique_fruits_df = df['fruits'].nunique()

print(unique_fruits_df)  # Output: 4

This code demonstrates using nunique with a Series within a DataFrame. It counts the unique fruits in the 'fruits' column.

Using len with set

import pandas as pd

# Create a Series
data = ["apple", "banana", "orange", "apple", "pear"]
fruits = pd.Series(data)

# Convert Series to set to remove duplicates and then get the length
unique_fruits_set = len(set(fruits.values))

print(unique_fruits_set)  # Output: 4

This approach converts the Series values to a set, which inherently removes duplicates. Then, len is used to get the number of elements in the set (unique values).

Using value_counts

# Count occurrences of each value
fruit_counts = fruits.value_counts()

# Get the number of unique values (length of Series)
unique_fruits_counts = len(fruit_counts)

print(unique_fruits_counts)  # Output: 4

Here, value_counts creates a Series showing the frequency of each value. The length of this Series (len(fruit_counts)) represents the number of unique values.

Looping with conditional statements (Less efficient)

unique_count = 0
seen = set()
for value in fruits.values:
  if value not in seen:
    unique_count += 1
    seen.add(value)

print(unique_count)  # Output: 4

This approach iterates through the Series values. It keeps track of seen values in a set (seen). If a value is not found in the set, it's considered unique, and the counter (unique_count) is incremented. This method is generally less efficient than the other options, especially for large datasets.

Looping with conditionals is generally discouraged for large datasets due to its inefficiency.
The set-based approach offers an alternative but might be less performant for very large datasets.
If you specifically need to keep track of the individual counts of each unique value, value_counts might be useful.
For most cases, pandas.Series.nunique is the recommended choice due to its efficiency and clarity.