Extracting Information from Strings with NumPy's char.partition()


Functionality

  • Returns a new array with three elements for each input element:
    • The part before the separator (leftmost portion)
    • The separator itself
    • The part after the separator (rightmost portion)
  • Splits elements in a NumPy array of strings (or a single string) at the first occurrence of a specified separator.

Syntax

import numpy as np

output = np.char.partition(input_array, separator)

Parameters

  • separator: The substring (delimiter) used to split the strings. Can be a string or a regular expression.
  • input_array: A NumPy array containing strings or a single string.

Return Value

  • A new NumPy array with the same shape as the input array, but each element is a tuple (or list) containing the three parts:
    • The part before the separator (leftmost portion)
    • The separator itself
    • The part after the separator (rightmost portion)

Behavior if Separator Not Found

  • If the separator is not found in a string element, the function returns a tuple containing:
    • The original string itself
    • Two empty strings for the separator and the part after the separator

Example

import numpy as np

data = np.array(['apple-banana-cherry', 'grapefruit', 'orange'])
separator = '-'

result = np.char.partition(data, separator)

print(result)

Output

[['apple' 'banana' 'cherry']
 ['grapefruit' '' '']
 ['orange' '' '']]

Key Points

  • If you need to split at multiple occurrences of the separator, consider using str.split() or regular expressions with np.char.split().
  • It's useful for splitting strings based on a delimiter and extracting specific parts.
  • char.partition() operates element-wise on the input array.
  • char.partition() might be deprecated in future NumPy versions, so be aware of potential updates and alternatives.
  • While char.partition() is convenient for basic splitting at the first occurrence, for more complex splitting scenarios, explore str.split() or regular expressions with np.char.split().


Extracting File Extensions

import numpy as np

filenames = np.array(['image.jpg', 'data.csv', 'report.pdf', 'noname'])
separator = '.'

extensions = np.char.partition(filenames, separator)[:, -1]  # Extract only extensions

print(extensions)

This code splits filenames at the dot (.) to extract the file extensions and stores them in a separate array.

Handling Missing Separators

import numpy as np

data = np.array(['apple', 'banana-cherry', 'orange'])
separator = '-'

result = np.char.partition(data, separator)

# Check for missing separators (empty second element)
missing_separator = result[:, 1] == ''
print(data[missing_separator])  # Print elements without separators

This code identifies elements in the data array that lack the separator (-) using conditional indexing and prints them.

Using Regular Expressions (more advanced)

import numpy as np

text = np.array(['This is a sentence. Here is another.', 'No separators here'])
separator = r'\.'  # Raw string for literal dot (period)

result = np.char.split(text, separator)

print(result)

This code uses a regular expression (r'\.') to split strings at any occurrence of a period (.) and returns an array containing all the split substrings for each element.



str.split()

  • Can be used with NumPy arrays by applying it element-wise using vectorized functions like np.vectorize().
  • Operates on individual strings, not NumPy arrays directly.
  • More widely used and flexible for splitting strings.

Example

import numpy as np

data = np.array(['apple-banana-cherry', 'grapefruit', 'orange'])
separator = '-'

def split_func(string, sep):
  return string.split(sep)

result = np.vectorize(split_func)(data, separator)

print(result)

np.char.split()

  • Can handle more complex splitting scenarios using regular expressions.
  • Similar to str.split() but works on NumPy arrays directly.

Example (basic split)

import numpy as np

data = np.array(['apple-banana-cherry', 'grapefruit', 'orange'])
separator = '-'

result = np.char.split(data, separator)

print(result)

Example (using regular expressions)

import numpy as np

text = np.array(['This is a sentence. Here is another.', 'No separators here'])
separator = r'\s+'  # Split on one or more whitespace characters

result = np.char.split(text, separator)

print(result)
  • For more flexibility and control over splitting, especially at multiple occurrences or with regular expressions, use str.split() with vectorization or np.char.split().
  • If you need basic splitting at the first occurrence and work with NumPy arrays, char.partition() might be sufficient (but consider potential deprecation).