8.6. Post-Training Analysis

After training, Tiny ML Tensorlab provides comprehensive analysis tools to evaluate model performance and understand its behavior.

8.6.1. Overview

Post-training analysis helps you:

Evaluate model accuracy and error patterns
Understand which classes are confused
Select optimal operating thresholds
Verify quantization impact
Generate reports for stakeholders

8.6.2. Enabling Analysis

Post-training analysis is generated automatically when testing is enabled:

testing:
  enable: True

8.6.3. Output Files

After testing, you’ll find analysis outputs:

.../testing/
├── confusion_matrix_test.png        # Confusion matrix
├── One_vs_Rest_MultiClass_ROC_test.png  # ROC curves
├── Histogram_Class_Score_differences_test.png  # Score distributions
├── fpr_tpr_thresholds.csv           # Threshold analysis
├── classification_report.txt        # Per-class metrics
├── error_samples/                   # Misclassified examples
│   ├── error_001.csv
│   └── ...
└── test_results.json                # Summary statistics

8.6.4. Confusion Matrix

Shows classification results in matrix form:

                 Predicted
                 A    B    C
Actual    A    95    3    2
          B     1   97    2
          C     2    1   97

Interpreting:

Diagonal = correct predictions
Off-diagonal = misclassifications
Rows sum to actual class counts
Columns show predicted distribution

Good matrix:

Strong diagonal (high values)
Weak off-diagonal (low values)

Problem indicators:

High off-diagonal values = specific class confusion
Asymmetric confusion = direction-specific errors

8.6.5. ROC Curves

Receiver Operating Characteristic shows trade-off between:

True Positive Rate (sensitivity)
False Positive Rate (1 - specificity)

Example ROC Curves:

The ROC curve shows the trade-off between sensitivity and specificity at different thresholds.

TPR (Sensitivity)
1.0 |        ******
    |      **
    |    **
0.5 |  **
    | **
0.0 +*-------------- FPR
    0.0    0.5    1.0

Key Metrics:

AUC (Area Under Curve): 1.0 = perfect, 0.5 = random
Operating Point: Where you set the threshold

Multi-Class ROC:

For multi-class problems, one-vs-rest ROC shows each class:

Class A: AUC = 0.98
Class B: AUC = 0.95
Class C: AUC = 0.99

8.6.6. Class Score Histograms

Shows distribution of model confidence for each class:

Example Class Score Histogram:

Correct predictions: [=====|=====] centered at high score
Wrong predictions:   [==|==] centered at low score

Interpretation:

Well-separated histograms: Model is confident and correct
Overlapping histograms: Model is uncertain
Wrong predictions at high scores: Confident mistakes (investigate)

8.6.7. FPR/TPR Thresholds

CSV file for threshold selection:

threshold,tpr,fpr,precision,recall,f1
1,0.99,0.15,0.87,0.99,0.93
3,0.97,0.08,0.92,0.97,0.94
5,0.95,0.03,0.97,0.95,0.96
7,0.90,0.01,0.99,0.90,0.94
9,0.80,0.00,1.00,0.80,0.89

Using this data:

Choose your priority (minimize FPR or maximize TPR)
Find the threshold that meets your requirement
Use that threshold in deployment code

8.6.8. Classification Report

Per-class performance metrics:

Class     Precision  Recall  F1-Score  Support
Normal    0.98       0.96    0.97      500
Fault_A   0.95       0.97    0.96      480
Fault_B   0.97       0.95    0.96      520

Accuracy: 0.96
Macro Avg: 0.97    0.96    0.96      1500
Weighted Avg: 0.96 0.96    0.96      1500

Metrics explained:

Precision: Of predicted positives, how many are correct?
Recall: Of actual positives, how many were detected?
F1-Score: Harmonic mean of precision and recall
Support: Number of samples per class

8.6.9. Error Analysis

Detailed examination of misclassified samples:

testing:
  error_analysis:
    save_errors: True
    max_errors_per_class: 20

Error sample files:

Each saved error includes:

Original input data
True label
Predicted label
Model confidence scores

Using error analysis:

Identify patterns in errors
Check for labeling mistakes
Find data collection issues
Improve feature extraction

8.6.10. Quantized vs Float Comparison

Compare quantized model to float baseline:

testing:
  enable: True
  test_float: True
  test_quantized: True
  compare_results: True

Output:

Float32 Model:
Accuracy: 99.2%
F1-Score: 0.992

INT8 Quantized Model:
Accuracy: 98.8%
F1-Score: 0.988

Degradation: 0.4%

8.6.11. File-Level Classification Summary

The File-Level Classification Summary provides an overview of how samples from each input file are classified into different classes. It helps users quickly identify if any particular file contains misclassified samples.

While the confusion matrix shows overall counts of correct and incorrect classifications, it does not reveal which specific files contain those misclassified samples. For example, even if the total misclassification count is small, it might come entirely from one problematic file. This feature helps pinpoint such cases instantly.

Output Location:

The summary is written to file_level_classification_summary.log inside the training/base/ directory of your project run:

.../data/projects/{dataset_name}/run/{date-time}/{model_name}/training/base/
└── file_level_classification_summary.log

The log file contains tables for float train, quantized train, and test data, depending on which stages are enabled in the configuration. Each table shows each file, its true class, and the count of samples from that file classified into each class.

Example: Fan Blade Fault Classification

Consider a fan blade fault classification dataset with four classes: Normal, BladeDamage, BladeImbalance, and BladeObstruction. The confusion matrix for float train best epoch might look like this:

Confusion Matrix (Float Train)
Ground Truth	BladeDamage	BladeImbalance	BladeObstruction	Normal
BladeDamage	1159	339	0	0
BladeImbalance	0	1301	0	0
BladeObstruction	0	0	962	0
Normal	0	0	0	2114

From this confusion matrix, we can see that while all classes other than BladeDamage are correctly classified, some BladeDamage samples are incorrectly classified as BladeImbalance. However, from the confusion matrix alone, we cannot determine which specific files contain these misclassified samples.

When we inspect the File-Level Classification Summary of FloatTrain, we discover that in file numbers 0, 1, 2, 20, and 21, all samples were classified as BladeImbalance even though their true class is BladeDamage. Similarly, in the test data, file numbers 7 and 8 have all samples misclassified.

Tip

A higher count of samples in the wrong class column for a specific file indicates potential data or labeling issues in that file.

Use Cases:

Identifying labeling issues: Files where all samples are misclassified may have been assigned the wrong label during data collection.
Data quality assessment: Pinpoint specific recordings or data files that contain noisy, corrupted, or otherwise problematic data.
Targeted investigation: Rather than reviewing the entire dataset, focus review efforts on the specific files flagged by this summary.

8.6.12. Regression Analysis

For regression tasks, post-training analysis is generated automatically when testing is enabled. The following metrics are reported:

Example output:

Mean Squared Error (MSE): 0.023
Mean Absolute Error (MAE): 0.12
R² Score: 0.95
Max Error: 0.45

8.6.13. Anomaly Detection Analysis

For anomaly detection tasks, post-training analysis is generated automatically for all runs. No additional configuration options are needed beyond enabling testing:

testing:
  enable: True

Output Files:

The following analysis outputs are generated in the post_training_analysis/ folder:

``reconstruction_error_histogram.png`` – Histogram showing the distribution of reconstruction errors for normal and anomaly test data. Good separation between the two distributions indicates the model can distinguish anomalies effectively.
``threshold_performance.csv`` – Contains detection metrics (accuracy, precision, recall, F1 score, false positive rate) for each k value from 0 to 4.5. Use this file to select the optimal threshold for your deployment.

Example training log output:

Reconstruction Error Statistics:
Normal training data - Mean: 1.662490, Std: 1.968127
Anomaly test data   - Mean: 141.985321, Std: 112.756683
Normal test data    - Mean: 2.849831, Std: 1.343052

Threshold for K = 4.5: 10.519060
False positive rate: 0.00%
Anomaly detection rate (recall): 100.00%

Key indicators:

A large gap between normal mean error and anomaly mean error indicates good detection capability.
Check the reconstruction_error_histogram.png for visual confirmation of distribution separation.
Use threshold_performance.csv to find the k value that gives the best trade-off between precision and recall for your application.

8.6.14. Custom Analysis Scripts

For advanced analysis, use the saved model and data:

import torch
import numpy as np

# Load model
model = torch.load('path/to/best_model.pt')
model.eval()

# Load test data
test_data = np.load('path/to/test_data.npy')
test_labels = np.load('path/to/test_labels.npy')

# Run inference
with torch.no_grad():
    outputs = model(torch.tensor(test_data))
    predictions = outputs.argmax(dim=1)

# Custom analysis
# ... your analysis code

8.6.15. Generating Reports

For documentation or stakeholder communication:

testing:
  enable: True
  generate_report: True
  report_format: 'pdf'  # or 'html', 'markdown'

Report includes:

Model summary (architecture, parameters)
Training curves
Test metrics
Confusion matrix
ROC curves
Recommendations

8.6.16. Example: Complete Analysis Configuration

common:
  task_type: 'generic_timeseries_classification'
  target_device: 'F28P55'

dataset:
  dataset_name: 'dc_arc_fault_example_dsk'

data_processing_feature_extraction:
  feature_extraction_name: 'FFT1024Input_256Feature_1Frame_Full_Bandwidth'
  variables: 1

training:
  model_name: 'CLS_4k_NPU'
  training_epochs: 30
  quantization: 2
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

testing:
  enable: True
  test_float: True
  test_quantized: True
  analysis:
    confusion_matrix: True
    roc_curve: True
    class_histograms: True
    error_analysis: True
    save_errors: True
    max_errors_per_class: 10
  compare_results: True

8.6.17. Best Practices

Always review confusion matrix: Understand error patterns
Check ROC curves: Ensure good class separation
Analyze errors: Learn from misclassifications
Compare quantized: Verify acceptable accuracy drop
Document findings: Record analysis for future reference

8.6.18. Troubleshooting Low Accuracy

If overall accuracy is low:

Check GoF test results (dataset quality)
Try larger model
Increase training epochs
Improve feature extraction

If specific classes have low accuracy:

Check class balance
Investigate error samples
May need more data for those classes
Classes might be inherently similar

If quantized accuracy drops significantly:

Try QAT instead of PTQ
Use more calibration data
Keep sensitive layers at higher precision
Use larger model (more robust to quantization)

8.6.19. Next Steps

Deploy model: CCS Integration Guide
Optimize further: Neural Architecture Search
Review Common Errors if issues arise