Fine-Tuning the Journey: Cosine AnnealingLR for Effective Learning Rate Control in PyTorch
Purpose
- It implements a cosine annealing strategy, which gradually reduces the learning rate from its initial value to a minimum value following a cosine curve.
CosineAnnealingLR
is a learning rate scheduler used in PyTorch to dynamically adjust the learning rate during training.
How it Works
-
- You create a
CosineAnnealingLR
object, providing the optimizer (e.g.,SGD
,Adam
) andT_max
as arguments:optimizer = torch.optim.SGD(model.parameters(), lr=0.1) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
T_max
represents the number of iterations (epochs or batches) that constitute one full cycle of the cosine curve.
- You create a
-
Learning Rate Update
-
At each iteration (epoch or batch),
CosineAnnealingLR
calculates a new learning rate using the following formula:new_lr = eta_min + (eta_max - eta_min) / 2 * (1 + cos(pi * iter / T_max))
eta_min
(default: 0) is the minimum learning rate after the annealing process.eta_max
(default: the initial learning rate set in the optimizer) is the maximum learning rate.iter
is the current iteration number.pi
is the mathematical constant pi (approximately 3.14159).
-
The cosine function oscillates between -1 and 1, resulting in a smooth learning rate decrease from
eta_max
toeta_min
overT_max
iterations, following a cosine curve.
-
Benefits
- Improved Convergence
A well-chosen cosine annealing schedule can help the model converge more smoothly and efficiently. - Gradual Learning Rate Adjustment
The cosine annealing approach avoids sudden learning rate drops, potentially preventing the model from oscillating or getting stuck in local minima.
Integration with Training Loop
-
Create Optimizer and Scheduler
optimizer = torch.optim.SGD(model.parameters(), lr=0.1) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
-
Training Loop
for epoch in range(num_epochs): # ... training steps ... optimizer.zero_grad() # Clear gradients loss.backward() optimizer.step() scheduler.step() # Update learning rate after each epoch (or batch)
Additional Notes
- Consider other learning rate scheduling techniques like
ReduceLROnPlateau
orStepLR
depending on your training needs. - Experiment with different
T_max
values to find the optimal annealing schedule for your specific dataset and model architecture.
Example 1: Cosine Annealing with Epoch-Based Updates
This example shows how to use CosineAnnealingLR
with updates happening after each epoch:
import torch
from torch import nn
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR
# Define a simple model (replace with your actual model)
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.linear = nn.Linear(10, 1)
def forward(self, x):
return self.linear(x)
# Create model and optimizer
model = MyModel()
optimizer = SGD(model.parameters(), lr=0.1)
# Set up cosine annealing scheduler with T_max epochs
num_epochs = 100
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
# Training loop
for epoch in range(num_epochs):
# ... training steps (data loading, forward pass, loss calculation, backward pass) ...
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Update learning rate after each epoch
scheduler.step()
# Print or log learning rate for monitoring
print(f"Epoch: {epoch+1}, Learning Rate: {optimizer.param_groups[0]['lr']:.5f}")
Example 2: Cosine Annealing with Warm Restarts (Custom Implementation)
This example demonstrates a custom implementation of cosine annealing with warm restarts, which is not directly available in CosineAnnealingLR
.
import torch
from torch import nn
from torch.optim import SGD
def cosine_annealing_warm_restarts(optimizer, T_0, T_mult, eta_min=0, eta_max=None):
"""
Custom cosine annealing with warm restarts scheduler.
Args:
optimizer (torch.optim.Optimizer): Optimizer to adjust learning rate for.
T_0 (int): Number of iterations (epochs or batches) in the first cycle.
T_mult (float): Multiplier to increase T_0 for subsequent cycles.
eta_min (float, optional): Minimum learning rate. Defaults to 0.
eta_max (float, optional): Maximum learning rate. Defaults to optimizer's initial lr.
Returns:
None
"""
if eta_max is None:
eta_max = optimizer.param_groups[0]['lr']
eta_t = eta_min + (eta_max - eta_min) / 2 * (1 + torch.cos(torch.pi * (optimizer.last_epoch - 1) / T_0))
for param_group in optimizer.param_groups:
param_group['lr'] = eta_t
T_0 *= T_mult
optimizer.last_epoch += 1
# Define model and optimizer (same as Example 1)
# ...
# Set up warm restarts with T_0 epochs and T_mult multiplier
T_0 = 10 # Initial number of epochs per cycle
T_mult = 2 # Multiplier to increase T_0 for subsequent cycles
# Training loop
for epoch in range(num_epochs):
# ... training steps (data loading, forward pass, loss calculation, backward pass) ...
optimizer.zero_grad()
loss.backward()
optimizer.step()
cosine_annealing_warm_restarts(optimizer, T_0, T_mult)
# Print or log learning rate for monitoring
print(f"Epoch: {epoch+1}, Learning Rate: {optimizer.param_groups[0]['lr']:.5f}")
ReduceLROnPlateau
- Adjust
factor
(learning rate reduction multiplier) andpatience
based on your needs. - Use:
torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=10)
. - It's useful when you want to prevent overfitting and improve convergence in scenarios where the validation loss stagnates.
- This scheduler dynamically reduces the learning rate when the validation loss plateaus for a specified number of epochs (
patience
).
MultiStepLR
- Modify
milestones
andgamma
based on your learning rate decay preferences. - Use:
torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)
. - It offers a more rigid approach compared to cosine annealing, suitable when you have a good understanding of how the training should progress.
- This scheduler decreases the learning rate by a predefined factor (
gamma
) at specific epochs defined bymilestones
.
ExponentialLR
- Adjust
gamma
to control the aggressiveness of learning rate decay. - Use:
torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
. - It's a simpler alternative to cosine annealing but can be less smooth in learning rate adjustments.
- This scheduler exponentially decays the learning rate after each epoch (or batch) with a factor of
gamma
.
OneCycleLR (from torch.optim.lr_scheduler or external libraries like fastai)
- Adjust
max_lr
based on your initial learning rate preference. - Use:
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, epochs=num_epochs, steps_per_epoch=len(train_dataloader))
. - It can be a good choice for finding a suitable learning rate automatically without manual tuning.
- This scheduler implements a one-cycle policy with an initial rise in learning rate, followed by a cosine annealing-like decrease.
- Implement logic to adjust the learning rate based on factors like validation loss, training progress, or other metrics.
- You can create custom learning rate schedulers based on your specific needs. This allows for fine-grained control over how the learning rate changes during training.