Delving into torch.distributed.checkpoint.LoadPlanner.set_up_planner() for Distributed Checkpointing in PyTorch

Distributed Checkpoint in PyTorch

Distributed training allows you to train large models across multiple machines (processes) for faster processing. Distributed Checkpoint helps manage the checkpoints (model states) during such training.

LoadPlanner.set_up_planner() Function

This function is part of PyTorch's distributed checkpoint functionality and is called on all participating ranks (processes) involved in the distributed training. It signifies the beginning of the process of loading a checkpoint.

Signal Start of Loading
When set_up_planner() is called, it essentially broadcasts a message to all ranks that the loading of a checkpoint is about to commence. This ensures all processes are synchronized and ready for the loading operation.
Preparation for Local Planning
This function serves as a starting point for the loading process. It might perform some preliminary tasks or setup actions to prepare each rank for the subsequent step of creating a local plan.

What Happens Next?

Following set_up_planner(), the create_local_plan() function is typically called on all ranks. This function analyzes the checkpoint's state dictionary (a mapping of module names to their corresponding parameters) and creates a local plan. This plan outlines how each rank will handle its portion of the loading process.

The local plans from all ranks are then exchanged and merged into a global plan that coordinates the overall loading across the distributed system. This ensures efficient and consistent loading of the checkpoint across all participating processes.

Key Points to Remember

It might perform preparatory actions for subsequent local planning.
It signals the start of the checkpoint loading process.
set_up_planner() is a distributed function, meaning it's called on all ranks.

Official PyTorch Tutorial

The PyTorch documentation offers a comprehensive tutorial on "Getting Started with Distributed Checkpoint (DCP)" . This tutorial walks you through saving and loading a checkpoint using DCP APIs, including load(). The load() function internally triggers set_up_planner() and subsequent steps on each rank.

PyTorch Examples

The PyTorch GitHub repository provides examples demonstrating distributed training with DataParallel (DDP). While not explicitly focusing on set_up_planner(), these examples showcase the overall workflow for distributed training and checkpointing, where loading would involve set_up_planner(). You can explore the examples/distributed/ddp/main.py file at .

Code Inspection

If you're comfortable with Python code inspection, you can delve into the PyTorch source code for torch.distributed.checkpoint. Here, you might find references to set_up_planner() within the implementation of the load() function. However, this approach requires familiarity with PyTorch internals and might be more advanced.

Manual Load Balancing

If you have a strong understanding of the checkpoint structure and how the data is distributed across ranks, you could potentially design a custom load-balancing scheme for loading the checkpoint. This would involve manually coordinating which ranks load which parts of the checkpoint data. However, this approach is complex, error-prone, and less scalable compared to set_up_planner().

Simplified Checkpointing

If you don't need the full functionality of distributed checkpointing, you could consider a simpler approach for saving and loading model states. Options include:

Using torch.save() and torch.load() with DataParallel (DDP) for basic checkpointing on a single machine or a cluster with shared storage.

Alternative Distributed Frameworks

If you're open to exploring other distributed training frameworks besides PyTorch, some libraries like TensorFlow offer distributed checkpointing mechanisms with their own planning and loading strategies. However, this would require switching frameworks and potentially rewriting parts of your training script.