Fine-Tuning the Journey: Cosine AnnealingLR for Effective Learning Rate Control in PyTorch

Purpose

It implements a cosine annealing strategy, which gradually reduces the learning rate from its initial value to a minimum value following a cosine curve.
CosineAnnealingLR is a learning rate scheduler used in PyTorch to dynamically adjust the learning rate during training.

How it Works

- You create a CosineAnnealingLR object, providing the optimizer (e.g., SGD, Adam) and T_max as arguments:
```
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
```
- T_max represents the number of iterations (epochs or batches) that constitute one full cycle of the cosine curve.
Learning Rate Update
- At each iteration (epoch or batch), CosineAnnealingLR calculates a new learning rate using the following formula:
```
new_lr = eta_min + (eta_max - eta_min) / 2 * (1 + cos(pi * iter / T_max))
```
  - eta_min (default: 0) is the minimum learning rate after the annealing process.
  - eta_max (default: the initial learning rate set in the optimizer) is the maximum learning rate.
  - iter is the current iteration number.
  - pi is the mathematical constant pi (approximately 3.14159).
- The cosine function oscillates between -1 and 1, resulting in a smooth learning rate decrease from eta_max to eta_min over T_max iterations, following a cosine curve.

Benefits

Improved Convergence
A well-chosen cosine annealing schedule can help the model converge more smoothly and efficiently.
Gradual Learning Rate Adjustment
The cosine annealing approach avoids sudden learning rate drops, potentially preventing the model from oscillating or getting stuck in local minima.

Integration with Training Loop

Create Optimizer and Scheduler

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

Training Loop

for epoch in range(num_epochs):
    # ... training steps ...
    optimizer.zero_grad()  # Clear gradients
    loss.backward()
    optimizer.step()
    scheduler.step()  # Update learning rate after each epoch (or batch)

Additional Notes

Consider other learning rate scheduling techniques like ReduceLROnPlateau or StepLR depending on your training needs.
Experiment with different T_max values to find the optimal annealing schedule for your specific dataset and model architecture.

Example 1: Cosine Annealing with Epoch-Based Updates

This example shows how to use CosineAnnealingLR with updates happening after each epoch:

import torch
from torch import nn
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR

# Define a simple model (replace with your actual model)
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

# Create model and optimizer
model = MyModel()
optimizer = SGD(model.parameters(), lr=0.1)

# Set up cosine annealing scheduler with T_max epochs
num_epochs = 100
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)

# Training loop
for epoch in range(num_epochs):
    # ... training steps (data loading, forward pass, loss calculation, backward pass) ...

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Update learning rate after each epoch
    scheduler.step()

    # Print or log learning rate for monitoring
    print(f"Epoch: {epoch+1}, Learning Rate: {optimizer.param_groups[0]['lr']:.5f}")

Example 2: Cosine Annealing with Warm Restarts (Custom Implementation)

This example demonstrates a custom implementation of cosine annealing with warm restarts, which is not directly available in CosineAnnealingLR.

import torch
from torch import nn
from torch.optim import SGD

def cosine_annealing_warm_restarts(optimizer, T_0, T_mult, eta_min=0, eta_max=None):
    """
    Custom cosine annealing with warm restarts scheduler.

    Args:
        optimizer (torch.optim.Optimizer): Optimizer to adjust learning rate for.
        T_0 (int): Number of iterations (epochs or batches) in the first cycle.
        T_mult (float): Multiplier to increase T_0 for subsequent cycles.
        eta_min (float, optional): Minimum learning rate. Defaults to 0.
        eta_max (float, optional): Maximum learning rate. Defaults to optimizer's initial lr.

    Returns:
        None
    """
    if eta_max is None:
        eta_max = optimizer.param_groups[0]['lr']
    eta_t = eta_min + (eta_max - eta_min) / 2 * (1 + torch.cos(torch.pi * (optimizer.last_epoch - 1) / T_0))
    for param_group in optimizer.param_groups:
        param_group['lr'] = eta_t

    T_0 *= T_mult
    optimizer.last_epoch += 1

# Define model and optimizer (same as Example 1)
# ...

# Set up warm restarts with T_0 epochs and T_mult multiplier
T_0 = 10  # Initial number of epochs per cycle
T_mult = 2  # Multiplier to increase T_0 for subsequent cycles

# Training loop
for epoch in range(num_epochs):
    # ... training steps (data loading, forward pass, loss calculation, backward pass) ...

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    cosine_annealing_warm_restarts(optimizer, T_0, T_mult)

    # Print or log learning rate for monitoring
    print(f"Epoch: {epoch+1}, Learning Rate: {optimizer.param_groups[0]['lr']:.5f}")

ReduceLROnPlateau

Adjust factor (learning rate reduction multiplier) and patience based on your needs.
Use: torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=10).
It's useful when you want to prevent overfitting and improve convergence in scenarios where the validation loss stagnates.
This scheduler dynamically reduces the learning rate when the validation loss plateaus for a specified number of epochs (patience).

MultiStepLR

Modify milestones and gamma based on your learning rate decay preferences.
Use: torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1).
It offers a more rigid approach compared to cosine annealing, suitable when you have a good understanding of how the training should progress.
This scheduler decreases the learning rate by a predefined factor (gamma) at specific epochs defined by milestones.

ExponentialLR

Adjust gamma to control the aggressiveness of learning rate decay.
Use: torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9).
It's a simpler alternative to cosine annealing but can be less smooth in learning rate adjustments.
This scheduler exponentially decays the learning rate after each epoch (or batch) with a factor of gamma.

OneCycleLR (from torch.optim.lr_scheduler or external libraries like fastai)

Adjust max_lr based on your initial learning rate preference.
Use: torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, epochs=num_epochs, steps_per_epoch=len(train_dataloader)).
It can be a good choice for finding a suitable learning rate automatically without manual tuning.
This scheduler implements a one-cycle policy with an initial rise in learning rate, followed by a cosine annealing-like decrease.

Implement logic to adjust the learning rate based on factors like validation loss, training progress, or other metrics.
You can create custom learning rate schedulers based on your specific needs. This allows for fine-grained control over how the learning rate changes during training.

Understanding torch.optim.NAdam.zero_grad() in PyTorch Optimization

In PyTorch, the zero_grad() method is a crucial part of the optimization process for training neural networks. It's used with optimizers like Adam (Adaptive Moment Estimation) and its variants

Fine-Tuning the Optimization Process: Alternatives to torch.optim.Optimizer.add_param_group() in PyTorch

It's particularly useful in scenarios like:Fine-tuning pre-trained models You might initially freeze certain layers (weights not updated) for stability but later want to fine-tune them by making them trainable and adding them to the optimizer using add_param_group().Applying different learning rates to different parameter groups You can create groups with specific learning rates or other optimization hyperparameters to tailor the update process for different parts of your model