Understanding pandas.Series.nunique: Your Guide to Counting Unique Values


Functionality

  • By default, it excludes missing values (represented as NaN) from the calculation.
  • It examines the values in the Series and returns an integer representing the count of distinct values.

Benefits

  • Useful for tasks like identifying unique categories, filtering duplicates, or analyzing data distribution.
  • Helps understand the variety of data present in a Series.

Example

import pandas as pd

# Create a Series with some data
data = ["apple", "banana", "orange", "apple", "pear"]
fruits = pd.Series(data)

# Count the number of unique fruits
unique_fruits = fruits.nunique()

print(unique_fruits)  # Output: 4

In this example, nunique counts the number of unique fruits ("apple", "banana", "orange", "pear"), excluding duplicates, and returns 4.

  • nunique is similar to the len function, but len counts all elements (including duplicates) while nunique only counts unique ones.
  • You can control how missing values are handled using the dropna parameter:
    • dropna=True (default): Excludes missing values.
    • dropna=False: Includes missing values in the count.


Including Missing Values

import pandas as pd

# Create a Series with missing values
data = ["apple", "banana", None, "orange", "apple"]
fruits = pd.Series(data)

# Count unique elements including missing values
unique_fruits_with_missing = fruits.nunique(dropna=False)

print(unique_fruits_with_missing)  # Output: 3

This code includes None (missing value) in the count by setting dropna=False.

Counting Unique Integers

# Create a Series with integer data
numbers = pd.Series([1, 2, 2, 3, 1, 4])

# Count unique numbers
unique_numbers = numbers.nunique()

print(unique_numbers)  # Output: 4

This example shows nunique working with numerical data (integers) and excluding duplicates.

Using nunique with DataFrames

import pandas as pd

# Create a DataFrame with a Series as a column
data = {'fruits': ["apple", "banana", "orange", "apple", "pear"],
        'numbers': [1, 2, 2, 3, 1]}
df = pd.DataFrame(data)

# Count unique fruits
unique_fruits_df = df['fruits'].nunique()

print(unique_fruits_df)  # Output: 4

This code demonstrates using nunique with a Series within a DataFrame. It counts the unique fruits in the 'fruits' column.



Using len with set

import pandas as pd

# Create a Series
data = ["apple", "banana", "orange", "apple", "pear"]
fruits = pd.Series(data)

# Convert Series to set to remove duplicates and then get the length
unique_fruits_set = len(set(fruits.values))

print(unique_fruits_set)  # Output: 4

This approach converts the Series values to a set, which inherently removes duplicates. Then, len is used to get the number of elements in the set (unique values).

Using value_counts

# Count occurrences of each value
fruit_counts = fruits.value_counts()

# Get the number of unique values (length of Series)
unique_fruits_counts = len(fruit_counts)

print(unique_fruits_counts)  # Output: 4

Here, value_counts creates a Series showing the frequency of each value. The length of this Series (len(fruit_counts)) represents the number of unique values.

Looping with conditional statements (Less efficient)

unique_count = 0
seen = set()
for value in fruits.values:
  if value not in seen:
    unique_count += 1
    seen.add(value)

print(unique_count)  # Output: 4

This approach iterates through the Series values. It keeps track of seen values in a set (seen). If a value is not found in the set, it's considered unique, and the counter (unique_count) is incremented. This method is generally less efficient than the other options, especially for large datasets.

  • Looping with conditionals is generally discouraged for large datasets due to its inefficiency.
  • The set-based approach offers an alternative but might be less performant for very large datasets.
  • If you specifically need to keep track of the individual counts of each unique value, value_counts might be useful.
  • For most cases, pandas.Series.nunique is the recommended choice due to its efficiency and clarity.