Beyond set_up_storage_writer(): Alternative Approaches for Distributed Checkpointing in PyTorch
Distributed Checkpoint (DCP) in PyTorch
DCP is a mechanism in PyTorch that facilitates saving and loading large models across multiple processes (ranks) in a distributed training environment. It offers several advantages over traditional saving methods like torch.save
:
- ShardedTensor and DTensor Support
DCP seamlessly handles sharded tensors and distributed tensors, which are essential for training large models on distributed systems. - Load-Time Resharding
It supports loading models saved in one cluster configuration onto a different cluster layout. - Parallelism
DCP enables concurrent saving and loading operations from all ranks, improving performance.
torch.distributed.checkpoint.StorageWriter.set_up_storage_writer()
- Storage Abstraction
DCP abstracts the storage backend, allowing you to save checkpoints to different locations (e.g., file systems, cloud storage). The storage writer handles the communication with the chosen storage service. - Process Group and Rank Awareness
It ensures the storage writer is configured for the current process group (identifies the set of distributed processes) and rank (unique identifier for each process). This enables each rank to write its local shard of the model to the appropriate location. - Storage Writer Creation
Based on the provided configuration,set_up_storage_writer()
creates a specific type of storage writer object (e.g.,FileStoreWriter
for saving to files).
Usage (Indirectly through DCP API)
While you wouldn't directly call set_up_storage_writer()
, you interact with DCP through the following functions:
torch.distributed.checkpoint.state_dict_saver.save(state_dict, ...)
: This function is the primary entry point for saving a distributed model using DCP. It internally callsset_up_storage_writer()
to set up the storage writer correctly.
- Interact with DCP through
state_dict_saver.save()
for model checkpointing. set_up_storage_writer()
is an internal function that handles storage writer configuration.- It provides efficient saving and loading of large models across multiple processes.
- DCP is designed for distributed training scenarios.
import torch
from torch.distributed import distributed as dist
# Assuming you've initialized the distributed process group
def save_checkpoint(model, filename):
"""Saves a model checkpoint using Distributed Checkpoint."""
# This call triggers the creation and configuration of the storage writer
torch.distributed.checkpoint.state_dict_saver.save(
state_dict={"model": model.state_dict()},
filename=filename,
storage_fn=lambda rank: f"{filename}.rank{rank}" # Optional, custom filename for each rank
)
# Example usage
model = torch.nn.Linear(10, 5)
save_checkpoint(model, "my_checkpoint.pt")
- Import necessary libraries
We importtorch
anddistributed
fromtorch.distributed
. - save_checkpoint function
This function takes the model and filename as arguments. - state_dict_saver.save
We calltorch.distributed.checkpoint.state_dict_saver.save()
. This function internally:- Creates a distributed checkpoint context.
- Calls
set_up_storage_writer()
(not shown explicitly) to configure the storage writer based on the distributed process group and rank. - Saves the model's state dictionary (
{"model": model.state_dict()}
) using the storage writer. Thestorage_fn
argument (optional) allows you to customize the filename for each rank's checkpoint shard.
- This is a simplified example. In a real application, you might wrap your model with
torch.nn.parallel.DistributedDataParallel
(DDP) for training across multiple GPUs on a single node.
If you don't need the full features of DCP (like load-time resharding), you can handle checkpointing manually. This involves:
- Saving each rank's local model state dictionary to a separate file using a suitable library (e.g.,
torch.save
). - Coordinating across ranks using distributed communication primitives (e.g.,
dist.broadcast
,dist.gather
) to ensure all data is saved consistently.
- Saving each rank's local model state dictionary to a separate file using a suitable library (e.g.,
This approach gives you more control but requires more manual implementation compared to DCP.