Fine-Tuning the Optimization Process: Alternatives to torch.optim.Optimizer.add_param_group() in PyTorch

Purpose

It's particularly useful in scenarios like:
- Fine-tuning pre-trained models
  You might initially freeze certain layers (weights not updated) for stability but later want to fine-tune them by making them trainable and adding them to the optimizer using add_param_group().
- Applying different learning rates to different parameter groups
  You can create groups with specific learning rates or other optimization hyperparameters to tailor the update process for different parts of your model.
This method allows you to dynamically add a new group of parameters to an existing optimizer during training.

Function Arguments

param_group (dict): This dictionary defines the new parameter group you're adding. It should contain the following keys (values can be optional if using optimizer defaults):
- params (iterable): An iterable of PyTorch tensors representing the parameters to be optimized in this group. This could be individual tensors or parameters from a layer's parameters() method.
- lr (float, optional): The learning rate to apply to this parameter group. If not specified, the optimizer's default learning rate will be used.
- momentum (float, optional): The momentum factor (used in optimizers like SGD with momentum) for this group. If not specified, the optimizer's default momentum will be used.
- weight_decay (float, optional): The L2 penalty coefficient (weight decay) for this group. If not specified, the optimizer's default weight decay will be used.
- Other optimizer-specific options (refer to the documentation for the specific optimizer you're using for a complete list).

Example

import torch
from torch import nn
from torch.optim import Adam

# Create a pre-trained model (assuming some layers are frozen)
model = nn.Sequential(
    # ... frozen layers ...
    nn.Linear(10, 5),  # Trainable layer
)

# Create an optimizer initially using only the trainable layer's parameters
optimizer = Adam(model.parameters())

# Later, decide to fine-tune some frozen layers
frozen_layer = nn.Linear(20, 10)  # Previously frozen layer
frozen_layer.requires_grad = True  # Make it trainable

# Create a new parameter group with a different learning rate for the frozen layer
new_param_group = {"params": frozen_layer.parameters(), "lr": 0.001}  # Lower learning rate

# Add this group to the optimizer
optimizer.add_param_group(new_param_group)

# Now, both the trainable layer and the fine-tuned frozen layer will be updated during training

Remember to set requires_grad=True on previously frozen layers before adding them to a parameter group.
You can add multiple parameter groups with different configurations to achieve more granular control over the optimization process.
add_param_group() modifies the optimizer's internal state, so make sure to call it before the training loop starts iterating over batches.

Applying different learning rates to different parameter groups

import torch
from torch import nn
from torch.optim import SGD

# Define a model with biases and weights
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 5)

# Create separate parameter groups for weights and biases
param_group1 = {'params': [param for name, param in model.named_parameters() if 'weight' in name]}
param_group2 = {'params': [param for name, param in model.named_parameters() if 'bias' in name], 'lr': 0.01}  # Lower learning rate for biases

# Create optimizer with different learning rates for groups
optimizer = SGD([param_group1, param_group2], lr=0.1)

In this example, we create separate parameter groups for weights and biases. The weights will be updated with a learning rate of 0.1, while the biases will use a lower learning rate of 0.01. This is a common strategy to prevent biases from dominating the learning process.

Fine-tuning a pre-trained model with different learning rates

import torch
from torch import nn
from torchvision import models
from torch.optim import Adam

# Load a pre-trained model (e.g., ResNet)
model = models.resnet18(pretrained=True)

# Freeze all layers except the last two for fine-tuning
for param in model.parameters():
    param.requires_grad = False  # Freeze all layers

# Make the last two layers trainable
for param in model.fc.parameters():
    param.requires_grad = True

# Create parameter groups with different learning rates:
# - Lower learning rate for the pre-trained layers (fine-tuning)
# - Higher learning rate for the newly added layers (faster adaptation)
param_group1 = {'params': [param for name, param in model.named_parameters() if 'fc' not in name], 'lr': 0.001}
param_group2 = {'params': [param for name, param in model.named_parameters() if 'fc' in name], 'lr': 0.01}

# Create optimizer with separate learning rates
optimizer = Adam([param_group1, param_group2])

This example fine-tunes a pre-trained ResNet model by freezing most layers and only making the last two layers trainable. It then creates separate parameter groups: one for the pre-trained layers with a lower learning rate for fine-tuning, and another for the newly trainable layers with a higher learning rate for faster adaptation.

Adding L2 weight decay to a specific parameter group

import torch
from torch import nn
from torch.optim import Adam

# Define a model
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 5)

# Create a parameter group with L2 weight decay for the first layer
param_group1 = {'params': model.fc1.parameters(), 'weight_decay': 0.001}
param_group2 = {'params': model.fc2.parameters()}  # No weight decay

# Create optimizer with L2 weight decay for fc1 only
optimizer = Adam([param_group1, param_group2])

Here, we create separate parameter groups for each layer of the model. The first layer's parameter group includes weight_decay=0.001 to apply L2 weight regularization, while the second layer's group doesn't have weight decay. This allows for more fine-grained control over regularization within your model.

- If you need drastically different learning rates or optimization algorithms for different parts of your model, consider creating separate optimizers for each portion.
- For example, you might use Adam for the main body of your network and SGD with momentum for the final layers.
- This approach provides strict separation but can be more complex to manage.
Manual Parameter Update
- In very specific scenarios, you might choose to manually update specific parameters during training. This involves:
  - Accessing individual parameters using model.layer.weight or model.layer.bias.
  - Calculating the update based on the chosen optimization algorithm (e.g., subtracting learning rate times gradient).
  - Assigning the updated value back to the parameter.
- This approach is highly manual and error-prone, so use it cautiously for specific research purposes.
Model Wrapping
- If you need complex optimization strategies beyond what add_param_group() offers, consider wrapping your model with a custom class that handles parameter updates.
- This allows fine-grained control over how parameters are grouped and updated during training.
- This is an advanced approach and requires in-depth understanding of optimization algorithms.

Choosing the Right Approach

The best alternative depends on your specific needs and level of control desired.

Model wrapping is a highly customized approach for researchers who need intricate control over the optimization process.
Creating multiple optimizers or manual parameter updates are more appropriate for advanced scenarios requiring very specific optimization strategies.
For most cases, add_param_group() is the simplest and recommended approach for adding parameter groups with different learning rates or other hyperparameters.