Beyond set_up_storage_writer(): Alternative Approaches for Distributed Checkpointing in PyTorch


Distributed Checkpoint (DCP) in PyTorch

DCP is a mechanism in PyTorch that facilitates saving and loading large models across multiple processes (ranks) in a distributed training environment. It offers several advantages over traditional saving methods like torch.save:

  • ShardedTensor and DTensor Support
    DCP seamlessly handles sharded tensors and distributed tensors, which are essential for training large models on distributed systems.
  • Load-Time Resharding
    It supports loading models saved in one cluster configuration onto a different cluster layout.
  • Parallelism
    DCP enables concurrent saving and loading operations from all ranks, improving performance.

torch.distributed.checkpoint.StorageWriter.set_up_storage_writer()

  1. Storage Abstraction
    DCP abstracts the storage backend, allowing you to save checkpoints to different locations (e.g., file systems, cloud storage). The storage writer handles the communication with the chosen storage service.
  2. Process Group and Rank Awareness
    It ensures the storage writer is configured for the current process group (identifies the set of distributed processes) and rank (unique identifier for each process). This enables each rank to write its local shard of the model to the appropriate location.
  3. Storage Writer Creation
    Based on the provided configuration, set_up_storage_writer() creates a specific type of storage writer object (e.g., FileStoreWriter for saving to files).

Usage (Indirectly through DCP API)

While you wouldn't directly call set_up_storage_writer(), you interact with DCP through the following functions:

  • torch.distributed.checkpoint.state_dict_saver.save(state_dict, ...): This function is the primary entry point for saving a distributed model using DCP. It internally calls set_up_storage_writer() to set up the storage writer correctly.
  • Interact with DCP through state_dict_saver.save() for model checkpointing.
  • set_up_storage_writer() is an internal function that handles storage writer configuration.
  • It provides efficient saving and loading of large models across multiple processes.
  • DCP is designed for distributed training scenarios.


import torch
from torch.distributed import distributed as dist

# Assuming you've initialized the distributed process group

def save_checkpoint(model, filename):
  """Saves a model checkpoint using Distributed Checkpoint."""

  # This call triggers the creation and configuration of the storage writer
  torch.distributed.checkpoint.state_dict_saver.save(
      state_dict={"model": model.state_dict()},
      filename=filename,
      storage_fn=lambda rank: f"{filename}.rank{rank}"  # Optional, custom filename for each rank
  )

# Example usage
model = torch.nn.Linear(10, 5)
save_checkpoint(model, "my_checkpoint.pt")
  1. Import necessary libraries
    We import torch and distributed from torch.distributed.
  2. save_checkpoint function
    This function takes the model and filename as arguments.
  3. state_dict_saver.save
    We call torch.distributed.checkpoint.state_dict_saver.save(). This function internally:
    • Creates a distributed checkpoint context.
    • Calls set_up_storage_writer() (not shown explicitly) to configure the storage writer based on the distributed process group and rank.
    • Saves the model's state dictionary ({"model": model.state_dict()}) using the storage writer. The storage_fn argument (optional) allows you to customize the filename for each rank's checkpoint shard.
  • This is a simplified example. In a real application, you might wrap your model with torch.nn.parallel.DistributedDataParallel (DDP) for training across multiple GPUs on a single node.


    • If you don't need the full features of DCP (like load-time resharding), you can handle checkpointing manually. This involves:

      • Saving each rank's local model state dictionary to a separate file using a suitable library (e.g., torch.save).
      • Coordinating across ranks using distributed communication primitives (e.g., dist.broadcast, dist.gather) to ensure all data is saved consistently.
    • This approach gives you more control but requires more manual implementation compared to DCP.