8.2. Quantization
Quantization reduces model precision from 32-bit floating point to lower bit widths (8-bit, 4-bit, or 2-bit integers), dramatically reducing model size and improving inference speed.
8.2.1. Overview
Why quantize?
Smaller models: 4x reduction (float32 → int8)
Faster inference: Integer operations are faster
NPU requirement: TI’s NPU requires quantized models
Lower power: Reduced memory bandwidth
8.2.2. Configuration Parameters
Quantization in Tiny ML Tensorlab is controlled by four parameters in the
training section of the config YAML:
Option |
Values |
Description |
|---|---|---|
|
|
Quantization mode. |
|
|
Quantization method. Only applicable when |
|
|
Bit width for weight quantization. Only applicable when
|
|
|
Bit width for activation quantization. Only applicable when
|
Note
quantization_method, quantization_weight_bitwidth, and
quantization_activation_bitwidth are only used when quantization
is set to 1 or 2. When quantization is 0 (floating point
training), these parameters have no effect.
8.2.3. Quantization Modes
Floating Point Training (quantization: 0)
Standard float32 training with no quantization applied:
training:
model_name: 'CLS_4k_NPU'
quantization: 0
Standard PyTorch Quantization (quantization: 1)
Uses standard PyTorch quantization APIs (GenericTinyMLQATFxModule /
GenericTinyMLPTQFxModule). Suitable for general-purpose CPU deployment:
training:
model_name: 'CLS_4k'
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
For more details on the underlying wrappers, see the tinyml-modeloptimization documentation.
TI Style Optimised Quantization (quantization: 2)
Uses TI’s NPU-optimised quantization (TINPUTinyMLQATFxModule /
TINPUTinyMLPTQFxModule). This incorporates the constraints of TI NPU
Hardware accelerator and is required for NPU deployment:
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
8.2.4. Quantization Methods
Post-Training Quantization (PTQ)
Quantizes a trained float model after training:
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'PTQ'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
Pros: Fast, simple, no retraining required
Cons: May lose accuracy for some models
Quantization-Aware Training (QAT)
Simulates quantization during training for better accuracy retention:
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
Pros: Better accuracy retention
Cons: Longer training time
8.2.5. Bit Widths
8-bit Quantization
Most common choice, good accuracy retention:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
Model size: 4x smaller than float32
Accuracy loss: Usually <1%
4-bit Quantization
Aggressive compression for size-constrained devices:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 4
quantization_activation_bitwidth: 4
Model size: 8x smaller than float32
Accuracy loss: 1-5% typical
2-bit Quantization
Maximum compression, limited use cases:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 2
quantization_activation_bitwidth: 2
Model size: 16x smaller than float32
Accuracy loss: Can be significant
Note
Weight and activation bit widths can be set independently. For example, you can use 8-bit activations with 4-bit weights:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 4
quantization_activation_bitwidth: 8
8.2.6. NPU Quantization Requirements
TI’s NPU requires TI style optimised quantization (quantization: 2):
common:
target_device: 'F28P55'
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
NPU Constraints:
Must use
quantization: 2(TI style optimised)INT8 or INT4 bit widths recommended
Symmetric quantization preferred
Per-channel quantization for weights
Per-tensor quantization for activations
8.2.7. Output Files
After quantization, you’ll find:
.../training/
├── base/
│ └── best_model.pt # Float32 model
└── quantization/
├── best_model.onnx # Quantized ONNX
└── quantization_config.yaml
8.2.8. Accuracy Comparison
Typical accuracy retention by bit width:
Precision |
Size Reduction |
Speed Improvement |
Accuracy Drop |
|---|---|---|---|
Float32 |
1x (baseline) |
1x (baseline) |
0% |
INT8 |
4x |
2-4x |
<1% |
INT4 |
8x |
3-6x |
1-5% |
INT2 |
16x |
4-8x |
5-15% |
Note: Results vary by model and task.
8.2.9. Troubleshooting Accuracy Loss
If quantization hurts accuracy:
1. Try QAT instead of PTQ:
training:
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
2. Use higher bit widths:
If using 4-bit or 2-bit, try 8-bit first:
training:
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
3. Keep activations at higher precision:
Use higher activation bit width with lower weight bit width:
training:
quantization_weight_bitwidth: 4
quantization_activation_bitwidth: 8
4. Increase model size:
A larger model may tolerate quantization better.
8.2.10. Best Practices
Start with INT8: Best balance of compression and accuracy
Use QAT for critical applications: When accuracy is paramount
Use TI optimised quantization for NPU: Set
quantization: 2for NPU targetsCompare float vs quantized: Always measure accuracy drop
Test on target device: Verify behavior matches simulation
8.2.11. Example: Full Quantization Workflow
common:
task_type: 'generic_timeseries_classification'
target_device: 'F28P55'
dataset:
dataset_name: 'dc_arc_fault_example_dsk'
data_processing_feature_extraction:
feature_extraction_name: 'FFT1024Input_256Feature_1Frame_Full_Bandwidth'
variables: 1
training:
model_name: 'ArcFault_model_400_t'
training_epochs: 30
batch_size: 256
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
testing:
enable: True
compilation:
enable: True
preset_name: 'compress_npu_layer_data'
Expected Results:
Float32 Model:
Accuracy: 99.2%
Size: 1.6 KB
INT8 Quantized Model:
Accuracy: 99.0%
Size: 0.4 KB
Speedup: 3.5x on NPU
8.2.12. Memory Savings
Quantization reduces memory at multiple levels:
Model Weights:
Float32: 4 bytes per parameter
INT8: 1 byte per parameter
INT4: 0.5 bytes per parameter
Example: 4000 parameter model
Float32: 16 KB
INT8: 4 KB
INT4: 2 KB
Activations:
Intermediate computations also benefit from reduced precision.
Total Memory:
For memory-constrained devices, quantization may be the difference between fitting and not fitting.
8.2.13. Performance Impact
NPU Performance:
Model: CLS_4k_NPU on F28P55
Float32 (CPU): ~5000 µs
INT8 (NPU): ~300 µs
INT4 (NPU): ~200 µs
CPU Performance:
Even without NPU, integer operations are faster:
Model: CLS_4k on F28P65 (no NPU)
Float32: ~5000 µs
INT8: ~2000 µs
8.2.14. Next Steps
Explore Neural Architecture Search for model optimization
Learn about Feature Extraction for input preparation
Deploy quantized models: NPU Device Deployment