Essential Functionalities for Working with pandas
Data Structures
- Series
A one-dimensional labeled array that can hold data of various types (integers, strings, Python objects, etc.). It's analogous to a list or NumPy array, but with labels attached to each data point, making it easier to select and access elements.
import pandas as pd
data = [1, 2, 3, 4, 5]
s = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(s)
- DataFrame
A two-dimensional labeled data structure with rows and columns, similar to a spreadsheet or SQL table. It consists of a collection of Series objects (often called columns), allowing you to store and work with different data types within the same table.
data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
df = pd.DataFrame(data)
print(df)
Viewing and Inspecting Data
- head()
Displays the first few rows (default: 5) of a DataFrame or Series, providing a quick glimpse at the beginning of your data.
print(df.head())
- tail()
Shows the last few rows (default: 5) of a DataFrame or Series, useful for examining the end of your data.
print(df.tail())
- shape
Returns a tuple representing the dimensions of the DataFrame, indicating the number of rows and columns.
print(df.shape)
- info()
Provides concise yet informative details about a DataFrame, including data types of each column and memory usage.
print(df.info())
- columns
Retrieves a list containing the column names (labels) of the DataFrame.
print(df.columns)
- dtypes
Returns a Series displaying the data type of each column in the DataFrame.
print(df.dtypes)
Selection and Indexing
- Accessing by Label
You can directly access elements or rows/columns using their labels (index or column names).
print(df['col1']) # Access column 'col1'
print(df.loc[0]) # Access first row by label (index 0)
- Boolean Indexing
Employ boolean expressions to filter and select specific rows or columns based on conditions.
filtered_df = df[df['col1'] > 2] # Select rows where 'col1' is greater than 2
- Integer-based Indexing
Use positional integers (zero-based) to access rows or columns by their position.
print(df.iloc[1]) # Access second row by position (index 1)
- describe()
Generates summary statistics for numerical columns (mean, standard deviation, quartiles, etc.), offering a high-level overview of the data's distribution.
print(df.describe())
- idxmin()/idxmax()
Locate the index labels of the minimum or maximum value within a Series or column, respectively.
print(df['col1'].idxmin()) # Index of the minimum value in 'col1'
Creating Series and DataFrame
import pandas as pd
# Create a Series from a list with labels
data = ['apple', 'banana', 'cherry', 'date', 'elderberry']
fruits = pd.Series(data, index=['Fruit 1', 'Fruit 2', 'Fruit 3', 'Fruit 4', 'Fruit 5'])
print(fruits)
# Create a DataFrame from a dictionary
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
'Population': [8.8, 3.9, 2.7, 2.3]}
cities = pd.DataFrame(data)
print(cities)
Viewing and Inspecting Data
# Display the first 2 rows of the 'cities' DataFrame
print(cities.head(2))
# Display the last row of the 'fruits' Series
print(fruits.tail(1))
# Get the dimensions (rows, columns) of the 'cities' DataFrame
print(cities.shape)
# Show information about the 'fruits' Series (data types, memory usage)
print(fruits.info())
# Access column names of the 'cities' DataFrame
print(cities.columns)
# Check data types of each column in the 'cities' DataFrame
print(cities.dtypes)
Selection and Indexing
# Access the 'Population' column of the 'cities' DataFrame
print(cities['Population'])
# Access the second row (index 1) of the 'fruits' Series
print(fruits.iloc[1])
# Select rows where 'Population' is greater than 3 million in 'cities'
filtered_cities = cities[cities['Population'] > 3]
print(filtered_cities)
# Access the row with label 'Fruit 3' in the 'fruits' Series
print(fruits.loc['Fruit 3'])
# Get descriptive statistics for numerical columns in 'cities' DataFrame
print(cities.describe())
# Find the index of the minimum value in the 'Population' column
min_population_city = cities['Population'].idxmin()
print(min_population_city) # This will print the label of the row with the minimum population
- Data Exploration and Analysis Tools
This emphasizes the functionalities used to examine and understand your data (viewing, head/tail, info, describe). - Core DataFrame Operations
This focuses on the essential operations you can perform on DataFrames, such as selection, filtering, and manipulation. - Getting Started with pandas
This highlights the introductory concepts for users who are new to pandas. - Fundamental Data Structures and Operations
This emphasizes the core building blocks (Series and DataFrame) and the basic operations you can perform on them (creation, selection, filtering).
The most suitable term depends on the context you're using it in. If you're targeting beginners, "Getting Started with pandas" might be appropriate. If you're focusing on core operations, "Core DataFrame Operations" could be better.
- pandas API Reference
The official pandas documentation uses the term "API Reference" to refer to the comprehensive list of functions, methods, and attributes available in the library. This is a more technical term suited for advanced users or reference purposes.