Customizing Quantization in PyTorch FX with ConvertCustomConfig

PyTorch Quantization and ConvertCustomConfig

PyTorch quantization is a technique for optimizing deep learning models by converting them from using high-precision floating-point numbers (e.g., 32-bit floats) to lower-precision integer representations (e.g., 8-bit integers). This reduces model size and improves inference speed on hardware that efficiently handles integer operations.

Backward Compatibility
This class offers a way to maintain compatibility with older quantization workflows that relied on manual configuration. It allows you to specify conversion details that would otherwise be automatically handled by newer PyTorch quantization mechanisms.

Key Methods of ConvertCustomConfig

- This static method creates a ConvertCustomConfig object from a dictionary containing configuration options. These options can include:
  - observed_to_quantized_mapping: A dictionary that maps observed module names (from the preparation stage) to their corresponding quantized module names. This is useful if you have custom modules that require specific naming conventions during conversion.
  - preserved_attributes: A list of attribute names that should be preserved during conversion. By default, certain attributes are excluded (e.g., observers) to reduce model size. This allows you to override that behavior if necessary.
set_observed_to_quantized_mapping(observed_to_quantized_dict)
- This method explicitly sets the mapping between observed and quantized module names. It's equivalent to providing the observed_to_quantized_mapping option in the from_dict method.
set_preserved_attributes(attributes_to_preserve)
- This method specifies a list of attribute names that you want to keep during conversion. It's the same as providing the preserved_attributes option in the from_dict method.

In essence, ConvertCustomConfig is a tool for fine-tuning the conversion stage of PyTorch FX-based quantization in specific scenarios where you need to manage observed-to-quantized module naming or preserve certain attributes.

Alternatives

import torch
from torch.ao.quantization.fx.custom_config import ConvertCustomConfig

class MyCustomModule(torch.nn.Module):
    def __init__(self):
        super(MyCustomModule, self).__init__()
        # Your custom module implementation here

    def forward(self, x):
        # Your custom forward pass here
        return x

# Define a custom conversion function for MyCustomModule
def convert_my_custom_module(module):
    # Modify the module here for quantization
    # (e.g., replace layers, insert quantization nodes)
    return module

# Create a ConvertCustomConfig object with custom mapping and preserved attributes
custom_config = ConvertCustomConfig.from_dict({
    "observed_to_quantized_mapping": {"my_custom_module": "quantized_my_custom_module"},
    "preserved_attributes": ["some_important_attribute"],
})

# FX-based quantization workflow (assuming you have a prepared `model`):
quantized_model = torch.quantization.quantize_fx(
    model,
    custom_config=custom_config,
    # Other quantization configuration options
)

# Now, the quantized model will have "quantized_my_custom_module"
# instead of "my_custom_module" and the "some_important_attribute"
# will be preserved during conversion.

We define a custom module MyCustomModule.
The convert_my_custom_module function outlines how to modify the module for quantization.
We create a ConvertCustomConfig object:
- observed_to_quantized_mapping specifies that the observed "my_custom_module" during preparation will be converted to "quantized_my_custom_module".
- preserved_attributes ensures that the "some_important_attribute" of the module is not discarded during conversion.
The quantize_fx function from PyTorch is used for FX-based quantization, passing the custom_config object along with other quantization configuration options.

Quantization Aware Training (QAT)

PyTorch provides built-in support for QAT through the torch.quantization.quantize function with the qconfig argument. This qconfig object defines the quantization configuration, including the type of quantization (e.g., quantize activations, weights), and allows for more granular control over the process.
This is a more modern approach where you train the model with simulated quantization noise during the training process itself. This often leads to better accuracy preservation compared to post-training quantization methods.

Lower-Level Quantization APIs

This approach offers maximum flexibility, but requires more manual effort and expertise.
You can use modules like torch.quantization.QuantStub and torch.quantization.DeQuantStub to wrap layers in your model and define quantization points. You can then manually quantize tensors and perform computations in lower precision.
PyTorch offers lower-level APIs for constructing quantized models directly. These APIs provide more control over the quantization process, but require a deeper understanding of the quantization mechanisms.

Third-Party Quantization Libraries

They might require adapting your model to their specific workflows, but can be useful for deploying models on specific platforms.
Several third-party libraries like TensorFlow Lite Micro or ONNX Runtime provide quantization tools that can be integrated with PyTorch models. These libraries often offer additional optimizations and hardware-specific support.

Consider third-party libraries if you need hardware-specific optimizations or deployment on specific platforms.
Lower-level APIs are useful for achieving maximum control over quantization or for research purposes.
For most cases, QAT is the recommended approach. It offers a good balance between accuracy and performance.