8.6. Post-Training Analysis
After training, Tiny ML Tensorlab provides comprehensive analysis tools to evaluate model performance and understand its behavior.
8.6.1. Overview
Post-training analysis helps you:
Evaluate model accuracy and error patterns
Understand which classes are confused
Select optimal operating thresholds
Verify quantization impact
Generate reports for stakeholders
8.6.2. Enabling Analysis
Post-training analysis is generated automatically when testing is enabled:
testing:
enable: True
8.6.3. Output Files
After testing, you’ll find analysis outputs:
.../testing/
├── confusion_matrix_test.png # Confusion matrix
├── One_vs_Rest_MultiClass_ROC_test.png # ROC curves
├── Histogram_Class_Score_differences_test.png # Score distributions
├── fpr_tpr_thresholds.csv # Threshold analysis
├── classification_report.txt # Per-class metrics
├── error_samples/ # Misclassified examples
│ ├── error_001.csv
│ └── ...
└── test_results.json # Summary statistics
8.6.4. Confusion Matrix
Shows classification results in matrix form:
Predicted
A B C
Actual A 95 3 2
B 1 97 2
C 2 1 97
Interpreting:
Diagonal = correct predictions
Off-diagonal = misclassifications
Rows sum to actual class counts
Columns show predicted distribution
Good matrix:
Strong diagonal (high values)
Weak off-diagonal (low values)
Problem indicators:
High off-diagonal values = specific class confusion
Asymmetric confusion = direction-specific errors
8.6.5. ROC Curves
Receiver Operating Characteristic shows trade-off between:
True Positive Rate (sensitivity)
False Positive Rate (1 - specificity)
Example ROC Curves:
One-vs-Rest Multi-class ROC curves for arc fault detection
The ROC curve shows the trade-off between sensitivity and specificity at different thresholds.
TPR (Sensitivity)
1.0 | ******
| **
| **
0.5 | **
| **
0.0 +*-------------- FPR
0.0 0.5 1.0
Key Metrics:
AUC (Area Under Curve): 1.0 = perfect, 0.5 = random
Operating Point: Where you set the threshold
Multi-Class ROC:
For multi-class problems, one-vs-rest ROC shows each class:
Class A: AUC = 0.98
Class B: AUC = 0.95
Class C: AUC = 0.99
8.6.6. Class Score Histograms
Shows distribution of model confidence for each class:
Example Class Score Histogram:
Distribution of class score differences showing model confidence
Correct predictions: [=====|=====] centered at high score
Wrong predictions: [==|==] centered at low score
Interpretation:
Well-separated histograms: Model is confident and correct
Overlapping histograms: Model is uncertain
Wrong predictions at high scores: Confident mistakes (investigate)
8.6.7. FPR/TPR Thresholds
CSV file for threshold selection:
threshold,tpr,fpr,precision,recall,f1
0.1,0.99,0.15,0.87,0.99,0.93
0.3,0.97,0.08,0.92,0.97,0.94
0.5,0.95,0.03,0.97,0.95,0.96
0.7,0.90,0.01,0.99,0.90,0.94
0.9,0.80,0.00,1.00,0.80,0.89
Using this data:
Choose your priority (minimize FPR or maximize TPR)
Find the threshold that meets your requirement
Use that threshold in deployment code
8.6.8. Classification Report
Per-class performance metrics:
Class Precision Recall F1-Score Support
Normal 0.98 0.96 0.97 500
Fault_A 0.95 0.97 0.96 480
Fault_B 0.97 0.95 0.96 520
Accuracy: 0.96
Macro Avg: 0.97 0.96 0.96 1500
Weighted Avg: 0.96 0.96 0.96 1500
Metrics explained:
Precision: Of predicted positives, how many are correct?
Recall: Of actual positives, how many were detected?
F1-Score: Harmonic mean of precision and recall
Support: Number of samples per class
8.6.9. Error Analysis
Detailed examination of misclassified samples:
testing:
error_analysis:
save_errors: True
max_errors_per_class: 20
Error sample files:
Each saved error includes:
Original input data
True label
Predicted label
Model confidence scores
Using error analysis:
Identify patterns in errors
Check for labeling mistakes
Find data collection issues
Improve feature extraction
8.6.10. Quantized vs Float Comparison
Compare quantized model to float baseline:
testing:
enable: True
test_float: True
test_quantized: True
compare_results: True
Output:
Float32 Model:
Accuracy: 99.2%
F1-Score: 0.992
INT8 Quantized Model:
Accuracy: 98.8%
F1-Score: 0.988
Degradation: 0.4%
8.6.11. File-Level Classification Summary
The File-Level Classification Summary provides an overview of how samples from each input file are classified into different classes. It helps users quickly identify if any particular file contains misclassified samples.
While the confusion matrix shows overall counts of correct and incorrect classifications, it does not reveal which specific files contain those misclassified samples. For example, even if the total misclassification count is small, it might come entirely from one problematic file. This feature helps pinpoint such cases instantly.
Output Location:
The summary is written to file_level_classification_summary.log inside the
training/base/ directory of your project run:
.../data/projects/{dataset_name}/run/{date-time}/{model_name}/training/base/
└── file_level_classification_summary.log
The log file contains tables for float train, quantized train, and test data, depending on which stages are enabled in the configuration. Each table shows each file, its true class, and the count of samples from that file classified into each class.
Example: Fan Blade Fault Classification
Consider a fan blade fault classification dataset with four classes: Normal, BladeDamage, BladeImbalance, and BladeObstruction. The confusion matrix for float train best epoch might look like this:
Ground Truth |
BladeDamage |
BladeImbalance |
BladeObstruction |
Normal |
|---|---|---|---|---|
BladeDamage |
1159 |
339 |
0 |
0 |
BladeImbalance |
0 |
1301 |
0 |
0 |
BladeObstruction |
0 |
0 |
962 |
0 |
Normal |
0 |
0 |
0 |
2114 |
From this confusion matrix, we can see that while all classes other than BladeDamage are correctly classified, some BladeDamage samples are incorrectly classified as BladeImbalance. However, from the confusion matrix alone, we cannot determine which specific files contain these misclassified samples.
When we inspect the File-Level Classification Summary of FloatTrain, we discover that in file numbers 0, 1, 2, 20, and 21, all samples were classified as BladeImbalance even though their true class is BladeDamage. Similarly, in the test data, file numbers 7 and 8 have all samples misclassified.
Tip
A higher count of samples in the wrong class column for a specific file indicates potential data or labeling issues in that file.
Use Cases:
Identifying labeling issues: Files where all samples are misclassified may have been assigned the wrong label during data collection.
Data quality assessment: Pinpoint specific recordings or data files that contain noisy, corrupted, or otherwise problematic data.
Targeted investigation: Rather than reviewing the entire dataset, focus review efforts on the specific files flagged by this summary.
8.6.12. Regression Analysis
For regression tasks, post-training analysis is generated automatically when testing is enabled. The following metrics are reported:
Example output:
Mean Squared Error (MSE): 0.023
Mean Absolute Error (MAE): 0.12
R² Score: 0.95
Max Error: 0.45
8.6.13. Anomaly Detection Analysis
For anomaly detection tasks, post-training analysis is generated automatically for all runs. No additional configuration options are needed beyond enabling testing:
testing:
enable: True
Output Files:
The following analysis outputs are generated in the post_training_analysis/
folder:
``reconstruction_error_histogram.png`` – Histogram showing the distribution of reconstruction errors for normal and anomaly test data. Good separation between the two distributions indicates the model can distinguish anomalies effectively.
``threshold_performance.csv`` – Contains detection metrics (accuracy, precision, recall, F1 score, false positive rate) for each k value from 0 to 4.5. Use this file to select the optimal threshold for your deployment.
Example training log output:
Reconstruction Error Statistics:
Normal training data - Mean: 1.662490, Std: 1.968127
Anomaly test data - Mean: 141.985321, Std: 112.756683
Normal test data - Mean: 2.849831, Std: 1.343052
Threshold for K = 4.5: 10.519060
False positive rate: 0.00%
Anomaly detection rate (recall): 100.00%
Key indicators:
A large gap between normal mean error and anomaly mean error indicates good detection capability.
Check the
reconstruction_error_histogram.pngfor visual confirmation of distribution separation.Use
threshold_performance.csvto find the k value that gives the best trade-off between precision and recall for your application.
8.6.14. Custom Analysis Scripts
For advanced analysis, use the saved model and data:
import torch
import numpy as np
# Load model
model = torch.load('path/to/best_model.pt')
model.eval()
# Load test data
test_data = np.load('path/to/test_data.npy')
test_labels = np.load('path/to/test_labels.npy')
# Run inference
with torch.no_grad():
outputs = model(torch.tensor(test_data))
predictions = outputs.argmax(dim=1)
# Custom analysis
# ... your analysis code
8.6.15. Generating Reports
For documentation or stakeholder communication:
testing:
enable: True
generate_report: True
report_format: 'pdf' # or 'html', 'markdown'
Report includes:
Model summary (architecture, parameters)
Training curves
Test metrics
Confusion matrix
ROC curves
Recommendations
8.6.16. Example: Complete Analysis Configuration
common:
task_type: 'generic_timeseries_classification'
target_device: 'F28P55'
dataset:
dataset_name: 'dc_arc_fault_example_dsk'
data_processing_feature_extraction:
feature_extraction_name: 'FFT1024Input_256Feature_1Frame_Full_Bandwidth'
variables: 1
training:
model_name: 'CLS_4k_NPU'
training_epochs: 30
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
testing:
enable: True
test_float: True
test_quantized: True
analysis:
confusion_matrix: True
roc_curve: True
class_histograms: True
error_analysis: True
save_errors: True
max_errors_per_class: 10
compare_results: True
8.6.17. Best Practices
Always review confusion matrix: Understand error patterns
Check ROC curves: Ensure good class separation
Analyze errors: Learn from misclassifications
Compare quantized: Verify acceptable accuracy drop
Document findings: Record analysis for future reference
8.6.18. Troubleshooting Low Accuracy
If overall accuracy is low:
Check GoF test results (dataset quality)
Try larger model
Increase training epochs
Improve feature extraction
If specific classes have low accuracy:
Check class balance
Investigate error samples
May need more data for those classes
Classes might be inherently similar
If quantized accuracy drops significantly:
Try QAT instead of PTQ
Use more calibration data
Keep sensitive layers at higher precision
Use larger model (more robust to quantization)
8.6.19. Next Steps
Deploy model: CCS Integration Guide
Optimize further: Neural Architecture Search
Review Common Errors if issues arise