Understanding PyTorch's Distributed Checkpointing: lookup_object() in DefaultSavePlanner
Distributed Checkpoints in PyTorch
DCP facilitates saving and loading models across multiple processes (ranks) running in parallel during distributed training. It offers several advantages over traditional saving methods like torch.save
:
- Flexibility
DCP seamlessly handles various distributed tensor types likeShardedTensor
andDTensor
. - Load-Time Resharding
Models can be saved in one cluster configuration and loaded into another with different settings. - Parallelism
Each rank saves only its local shards, improving efficiency.
DefaultSavePlanner and lookup_object()
The DefaultSavePlanner
class plays a crucial role in managing how objects are saved during checkpointing. It's responsible for determining the appropriate saving strategy for each object encountered within the model's state dictionary.
The lookup_object(name, state_dict)
method is an internal function of DefaultSavePlanner
that serves the following purposes:
- Object Identification
Given an object name (name
) and the model's state dictionary (state_dict
), it retrieves the actual object associated with that name. This object could be a parameter, optimizer, or any custom stateful object included in the state dictionary. - Planner Selection
Based on the type of the retrieved object,lookup_object()
might delegate the saving logic to a more specialized planner if necessary. For instance, if the object is aShardedTensor
, a planner tailored for sharded tensors might be used. - Default Saving
If no specific planner is required,lookup_object()
likely relies on the default saving mechanism withinDefaultSavePlanner
. This might involve converting the object's state to a format suitable for serialization and storage.
In Summary
- It might delegate to specialized planners or handle the saving process itself based on the object's type.
- It's responsible for identifying objects in the model's state dictionary and determining the appropriate saving strategy for each object.
torch.distributed.checkpoint.DefaultSavePlanner.lookup_object()
is an internal function within the Distributed Checkpoint framework in PyTorch.
- If you need to customize the saving behavior for specific objects, you can potentially subclass
DefaultSavePlanner
and override its methods. However, this is an advanced technique and requires a deeper understanding of the Distributed Checkpoint API.
import torch
import torch.distributed as dist
from torch.distributed.checkpoint import DefaultSavePlanner
def train_model(model, optimizer, rank, world_size):
# ... (Your training loop here)
# Save checkpoint using DCP
if rank == 0: # Only rank 0 performs the save operation
state_dict = model.state_dict()
optimizer_state_dict = optimizer.state_dict()
# Create a DefaultSavePlanner instance
save_planner = DefaultSavePlanner()
# Internal lookup_object() is called here
for name, tensor in state_dict.items():
save_planner.lookup_object(name, state_dict) # Internal lookup and planning
# DCP will handle saving based on the planner's decisions
dist.barrier() # Ensure all processes are ready
dist.checkpoint(state_dict=state_dict, optimizer_state_dict=optimizer_state_dict, save_fn=save_planner)
if __name__ == "__main__":
# ... (Distributed initialization code here)
model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
train_model(model, optimizer, rank=dist.get_rank(), world_size=dist.get_world_size())
- The
train_model
function handles training and checkpointing. - Checkpointing only happens on rank 0 to avoid redundancy.
- We create a
DefaultSavePlanner
instance. - During state dictionary iteration,
DefaultSavePlanner.lookup_object()
is called internally for each object (parameter, optimizer state, etc.). This is where the planner identifies the object type and determines the saving strategy. - The actual saving logic is handled by
dist.checkpoint
, which interacts with the planner to save the model and optimizer states across ranks.
- You cannot directly call
lookup_object()
as it's an internal method. - This is a simplified example. The actual implementation of
lookup_object()
is more intricate and depends on the specific object type.
Custom Save Function
- Provide a custom function to the
save_fn
argument indist.checkpoint
. This function takes the state dictionary and atorch.distributed.checkpoint.SaveContext
object as arguments. - Within your custom function, you can iterate through the state dictionary and handle saving for each object based on its type.
- This approach gives you full control over the saving logic for all objects.
import torch.distributed as dist from torch.distributed.checkpoint import SaveContext def custom_save_fn(state_dict, save_context): for name, tensor in state_dict.items(): # Handle saving based on object type (e.g., special logic for custom objects) if isinstance(tensor, torch.Tensor): # Use torch.save or specialized saving for tensors pass else: # Custom logic for non-tensor objects pass # Signal completion to save_context save_context.finished() # ... (Training code) dist.checkpoint(state_dict=state_dict, save_fn=custom_save_fn)
- Provide a custom function to the
Subclassing DefaultSavePlanner
- Create a subclass of
DefaultSavePlanner
. - Override specific methods like
save_tensor()
andsave_non_tensor()
to customize saving behavior for different object types. - This approach allows you to modify existing planner logic while leveraging its core functionality.
Example (Simplified)
from torch.distributed.checkpoint import DefaultSavePlanner class CustomSavePlanner(DefaultSavePlanner): def save_tensor(self, name, tensor, save_context): # Implement custom logic for saving tensors (e.g., compression) super().save_tensor(name, tensor, save_context) # ... (Training code) save_planner = CustomSavePlanner() dist.checkpoint(state_dict=state_dict, save_fn=save_planner)
- Create a subclass of