8.2. Quantization
Quantization reduces model precision from 32-bit floating point to lower bit widths (8-bit, 4-bit, or 2-bit integers), dramatically reducing model size and improving inference speed.
8.2.1. Overview
Why quantize?
Smaller models: 4x reduction (float32 → int8)
Faster inference: Integer operations are faster
NPU requirement: TI’s NPU requires quantized models
Lower power: Reduced memory bandwidth
8.2.2. Configuration Parameters
Quantization in Tiny ML Tensorlab is controlled by four parameters in the
training section of the config YAML:
Option |
Values |
Description |
|---|---|---|
|
|
Quantization mode. |
|
|
Quantization method. Only applicable when |
|
|
Bit width for weight quantization. Only applicable when
|
|
|
Bit width for activation quantization. Only applicable when
|
Note
quantization_method, quantization_weight_bitwidth, and
quantization_activation_bitwidth are only used when quantization
is set to 1 or 2. When quantization is 0 (floating point
training), these parameters have no effect.
8.2.3. Quantization Modes
Floating Point Training (quantization: 0)
Standard float32 training with no quantization applied:
training:
model_name: 'CLS_4k_NPU'
quantization: 0
Standard PyTorch Quantization (quantization: 1)
Uses standard PyTorch quantization APIs (GenericTinyMLQATFxModule /
GenericTinyMLPTQFxModule). Suitable for general-purpose CPU deployment:
training:
model_name: 'CLS_4k'
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
For more details on the underlying wrappers, see Quantization Wrapper Architecture below.
TI Style Optimised Quantization (quantization: 2)
Uses TI’s NPU-optimised quantization (TINPUTinyMLQATFxModule /
TINPUTinyMLPTQFxModule). This incorporates the constraints of TI NPU
Hardware accelerator and is required for NPU deployment:
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
8.2.4. Quantization Methods
Post-Training Quantization (PTQ)
Quantizes a trained float model after training:
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'PTQ'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
Pros: Fast, simple, no retraining required
Cons: May lose accuracy for some models
Quantization-Aware Training (QAT)
Simulates quantization during training for better accuracy retention:
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
Pros: Better accuracy retention
Cons: Longer training time
8.2.5. Bit Widths
8-bit Quantization
Most common choice, good accuracy retention:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
Model size: 4x smaller than float32
Accuracy loss: Usually <1%
4-bit Quantization
Aggressive compression for size-constrained devices:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 4
quantization_activation_bitwidth: 4
Model size: 8x smaller than float32
Accuracy loss: 1-5% typical
2-bit Quantization
Maximum compression, limited use cases:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 2
quantization_activation_bitwidth: 2
Model size: 16x smaller than float32
Accuracy loss: Can be significant
Note
Weight and activation bit widths can be set independently. For example, you can use 8-bit activations with 4-bit weights:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 4
quantization_activation_bitwidth: 8
8.2.6. NPU Quantization Requirements
TI’s NPU requires TI style optimised quantization (quantization: 2):
common:
target_device: 'F28P55'
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
NPU Constraints:
Must use
quantization: 2(TI style optimised)INT8 or INT4 bit widths recommended
Symmetric quantization preferred
Per-channel quantization for weights
Per-tensor quantization for activations
8.2.7. Output Files
After quantization, you’ll find:
.../training/
├── base/
│ └── best_model.pt # Float32 model
└── quantization/
├── best_model.onnx # Quantized ONNX
└── quantization_config.yaml
8.2.8. Accuracy Comparison
Typical accuracy retention by bit width:
Precision |
Size Reduction |
Speed Improvement |
Accuracy Drop |
|---|---|---|---|
Float32 |
1x (baseline) |
1x (baseline) |
0% |
INT8 |
4x |
2-4x |
<1% |
INT4 |
8x |
3-6x |
1-5% |
INT2 |
16x |
4-8x |
5-15% |
Note: Results vary by model and task.
8.2.9. Troubleshooting Accuracy Loss
If quantization hurts accuracy:
1. Try QAT instead of PTQ:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
2. Use higher bit widths:
If using 4-bit or 2-bit, try 8-bit first:
training:
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
3. Keep activations at higher precision:
Use higher activation bit width with lower weight bit width:
training:
quantization_weight_bitwidth: 4
quantization_activation_bitwidth: 8
4. Increase model size:
A larger model may tolerate quantization better.
8.2.10. Best Practices
Start with INT8: Best balance of compression and accuracy
Use QAT for critical applications: When accuracy is paramount
Use TI optimised quantization for NPU: Set
quantization: 2for NPU targetsCompare float vs quantized: Always measure accuracy drop
Test on target device: Verify behavior matches simulation
8.2.11. Example: Full Quantization Workflow
common:
task_type: 'generic_timeseries_classification'
target_device: 'F28P55'
dataset:
dataset_name: 'dc_arc_fault_example_dsk'
data_processing_feature_extraction:
feature_extraction_name: 'FFT1024Input_256Feature_1Frame_Full_Bandwidth'
variables: 1
training:
model_name: 'CLS_4k_NPU'
training_epochs: 30
batch_size: 256
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
testing:
enable: True
compilation:
enable: True
preset_name: 'compress_npu_layer_data'
Expected Results:
Float32 Model:
Accuracy: 99.2%
Size: 1.6 KB
INT8 Quantized Model:
Accuracy: 99.0%
Size: 0.4 KB
Speedup: 3.5x on NPU
8.2.12. Memory Savings
Quantization reduces memory at multiple levels:
Model Weights:
Float32: 4 bytes per parameter
INT8: 1 byte per parameter
INT4: 0.5 bytes per parameter
Example: 4000 parameter model
Float32: 16 KB
INT8: 4 KB
INT4: 2 KB
Activations:
Intermediate computations also benefit from reduced precision.
Total Memory:
For memory-constrained devices, quantization may be the difference between fitting and not fitting.
8.2.13. Performance Impact
NPU Performance:
Model: CLS_4k_NPU on F28P55
Float32 (CPU): ~5000 µs
INT8 (NPU): ~300 µs
INT4 (NPU): ~200 µs
CPU Performance:
Even without NPU, integer operations are faster:
Model: CLS_4k on F28P65 (no NPU)
Float32: ~5000 µs
INT8: ~2000 µs
8.2.14. Quantization Wrapper Architecture
Under the hood, Tiny ML Tensorlab uses quantization wrapper classes from
the tinyml-modeloptimization package. Understanding the wrapper
architecture helps when customizing quantization or debugging.
Class Hierarchy:
TinyMLQuantFxBaseModule (base class)
├── TINPUTinyMLQuantFxModule
│ ├── TINPUTinyMLQATFxModule (quantization: 2, QAT)
│ └── TINPUTinyMLPTQFxModule (quantization: 2, PTQ)
│
└── GenericTinyMLQuantFxModule
├── GenericTinyMLQATFxModule (quantization: 1, QAT)
└── GenericTinyMLPTQFxModule (quantization: 1, PTQ)
TINPUTinyML wrappers (quantization: 2) incorporate the constraints
of TI NPU Hardware accelerator. They perform extensive graph transformations
including 13+ layer pattern replacements to produce NPU-compatible integer
operations. Key characteristics:
Enforces power-of-2 scale factors (mandatory for 8-bit quantization)
Transforms convolution, pooling, linear, and batch normalization layers to NPU-compatible patterns
Implements the NPU BNORM sequence:
Add (bias) → Mul (scale) → Div (2^n, right shift) → Floor → ClipAll operations in integer domain, no dequantization step
GenericTinyML wrappers (quantization: 1) use standard PyTorch
quantization APIs with minimal modifications, relying on ONNX Runtime for
optimization. Key characteristics:
Flexible scaling (no power-of-2 constraint)
Only 1 pattern replacement (permute + unsqueeze)
Uses PyTorch’s native quantized operations
Relies on ONNX Runtime optimization for deployment
Note
When using the toolchain via YAML configs, you do not need to interact
with these wrapper classes directly. Setting quantization: 1 or
quantization: 2 in the config selects the appropriate wrapper
automatically.
8.2.15. NPU Hardware Constraints
When using TI style optimised quantization (quantization: 2), the
following hardware constraints are enforced automatically by the TINPU
wrapper:
Channel Alignment:
Input and output channels must be multiples of 4. The NPU processes data in SIMD fashion with 4-channel vectors.
Layer Type |
Channel Requirement |
Notes |
|---|---|---|
FCONV (First Conv) |
Input: exactly 1, Output: multiple of 4 |
First layer in the network |
GCONV (Generic Conv) |
Input and Output: multiple of 4 |
General convolution layers |
DWCONV (Depthwise Conv) |
Input/Output: multiple of 4 |
Depthwise separable layers |
PWCONV (Pointwise Conv) |
Input/Output: multiple of 4 |
1x1 convolution layers |
FC (Fully Connected) |
Input: multiple of 4 |
Dense/linear layers |
Power-of-2 Scaling:
For 8-bit quantization, scale factors must be powers of 2. This enables efficient implementation as bit shifts in hardware, avoiding expensive division operations. For sub-8-bit quantization (4-bit, 2-bit), non-power-of-2 scales are supported and may provide better accuracy.
Bitwidth Constraints:
Parameter |
Allowed Values |
Notes |
|---|---|---|
Weight bitwidth |
2, 4, or 8 bits (signed) |
Determines model compression ratio |
Activation bitwidth |
8 bits (signed or unsigned) |
Fixed at 8 bits for NPU acceleration |
Bias |
16-bit (2b/4b weights), 24-bit (8b weights) |
Automatically computed |
Scale |
8-bit unsigned (2b/4b), power-of-2 shift (8b) |
Automatically computed |
Supported NPU Layer Patterns:
The NPU accelerates the following layer types: FCONV, GCONV, DWCONV, PWCONV, FC, AVGPOOL, MAXPOOL. Each layer includes a BNORM sequence (bias → scale → shift → floor → clip) that maps directly to NPU hardware units.
Warning
Models with layers that do not meet NPU constraints will fall back to
CPU execution for those layers. Use quantization: 1 (Generic) for
models that cannot satisfy these constraints.
8.2.16. Using Quantization Wrappers Directly
For advanced users who want to use the quantization wrappers outside the Tiny ML Tensorlab toolchain (e.g., in custom PyTorch training scripts), the wrappers can be imported and used directly.
TINPU QAT Example:
from tinyml_torchmodelopt.quantization import TINPUTinyMLQATFxModule
# Create and pretrain your model
model = MyNeuralNetwork()
model.load_state_dict(torch.load('pretrained.pth'))
# Wrap with TINPU quantization
model = TINPUTinyMLQATFxModule(model, total_epochs=epochs)
# Train the wrapped model (your usual training loop)
model.train()
for e in range(epochs):
for images, target in train_loader:
output = model(images)
# loss, backward(), optimizer step as usual
model.eval()
# Convert to integer operations
model = model.convert()
# Export to ONNX
dummy_input = torch.rand((1, 1, 256, 1))
model.export(dummy_input, 'model_int8.onnx', input_names=['input'])
Generic QAT Example:
from tinyml_torchmodelopt.quantization import GenericTinyMLQATFxModule
# Create and pretrain your model
model = MyNeuralNetwork()
model.load_state_dict(torch.load('pretrained.pth'))
# Wrap with Generic quantization
model = GenericTinyMLQATFxModule(model, total_epochs=epochs)
# Train, convert, and export (same API as TINPU)
# ...
model = model.convert()
model.export(dummy_input, 'model_int8.onnx', input_names=['input'])
PTQ (Post-Training Quantization):
For PTQ, replace the QAT module with the PTQ variant. PTQ only requires a calibration pass (forward pass on representative data) instead of full retraining:
from tinyml_torchmodelopt.quantization import TINPUTinyMLPTQFxModule
model = TINPUTinyMLPTQFxModule(model, total_epochs=1)
# Calibration pass (no backward, no optimizer)
model.eval()
with torch.no_grad():
for images, _ in calibration_loader:
model(images)
model = model.convert()
model.export(dummy_input, 'model_int8.onnx', input_names=['input'])
Evaluating Exported ONNX Models:
After exporting, you can evaluate the quantized ONNX model using ONNX Runtime:
import onnxruntime as ort
ort_session_options = ort.SessionOptions()
ort_session_options.graph_optimization_level = (
ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
)
ort_session = ort.InferenceSession('model_int8.onnx', ort_session_options)
prediction = ort_session.run(None, {'input': example_input.numpy()})
8.2.17. Wrapper API Reference
All quantization wrappers inherit from TinyMLQuantFxBaseModule, which
accepts the following parameters:
Parameter |
Type |
Description |
|---|---|---|
|
nn.Module |
The PyTorch model to quantize |
|
dict/None |
QConfig mapping for quantization. |
|
Tensor |
Example input with batch size 1 |
|
bool |
Toggle between QAT (True) and PTQ (False) |
|
str |
Backend: |
|
int |
Total number of quantized training epochs |
|
bool/int |
BatchNorm freezing control (see below) |
|
bool/int |
Observer freezing control (see below) |
|
float |
Bias calibration factor (0.0 = disabled) |
|
bool |
Enable verbose logging |
|
bool |
Use float bias for Conv/Linear layers. Increases accuracy but disables BNORM on TINPU hardware. |
BatchNorm Freezing (``num_batch_norm_update_epochs``):
None(default): Freezes BatchNorm statistics at the midpoint of trainingFalse: Never freezes BatchNorm (may cause overfitting)Integer value: Freezes after the specified epoch. Best results with half to 3/4 of total epochs.
Observer Freezing (``num_observer_update_epochs``):
False(default): Observers remain active throughout trainingInteger value: Freezes observers after the specified epoch
Tip
For best QAT results, set num_batch_norm_update_epochs to
approximately half of total_epochs. This allows the model to
learn quantization-aware representations before freezing statistics.
8.2.18. Model Surgery
The tinyml-modeloptimization package includes model surgery utilities
that use torch.fx to replace unsupported modules with efficient
alternatives. This is useful for adapting existing models to meet NPU
constraints.
Basic Usage:
from tinyml_torchmodelopt.surgery import convert_to_lite_fx
# Replace unsupported layers with default replacements
model = convert_to_lite_fx(model)
Custom Replacements:
You can define custom replacement rules:
import copy
from tinyml_torchmodelopt.surgery import (
convert_to_lite_fx, get_replacement_dict_default
)
# Get and modify the default replacement dictionary
replacement_dict = copy.deepcopy(get_replacement_dict_default())
replacement_dict.update({torch.nn.GELU: torch.nn.ReLU})
# Apply with custom replacements
model = convert_to_lite_fx(model, replacement_dict=replacement_dict)
The replacement value can also be a function for complex transformations:
replacement_dict.update({'my_layer': my_replacement_function})
model = convert_to_lite_fx(model, replacement_dict=replacement_dict)
Model surgery is applied automatically during the quantization pipeline when needed. Direct usage is only necessary for custom workflows.
8.2.19. Next Steps
Explore Neural Architecture Search for model optimization
Learn about Feature Extraction for input preparation
Deploy quantized models: NPU Device Deployment