8.5. Post-Training Analysis

After training, Tiny ML Tensorlab provides comprehensive analysis tools to evaluate model performance and understand its behavior.

8.5.1. Overview

Post-training analysis helps you:

  • Evaluate model accuracy and error patterns

  • Understand which classes are confused

  • Select optimal operating thresholds

  • Verify quantization impact

  • Generate reports for stakeholders

8.5.2. Enabling Analysis

Analysis is enabled through the testing section:

testing:
  enable: True
  analysis:
    confusion_matrix: True
    roc_curve: True
    class_histograms: True
    error_analysis: True

8.5.3. Output Files

After testing, you’ll find analysis outputs:

.../testing/
├── confusion_matrix_test.png        # Confusion matrix
├── One_vs_Rest_MultiClass_ROC_test.png  # ROC curves
├── Histogram_Class_Score_differences_test.png  # Score distributions
├── fpr_tpr_thresholds.csv           # Threshold analysis
├── classification_report.txt        # Per-class metrics
├── error_samples/                   # Misclassified examples
│   ├── error_001.csv
│   └── ...
└── test_results.json                # Summary statistics

8.5.4. Confusion Matrix

Shows classification results in matrix form:

                 Predicted
                 A    B    C
Actual    A    95    3    2
          B     1   97    2
          C     2    1   97

Interpreting:

  • Diagonal = correct predictions

  • Off-diagonal = misclassifications

  • Rows sum to actual class counts

  • Columns show predicted distribution

Good matrix:

  • Strong diagonal (high values)

  • Weak off-diagonal (low values)

Problem indicators:

  • High off-diagonal values = specific class confusion

  • Asymmetric confusion = direction-specific errors

8.5.5. ROC Curves

Receiver Operating Characteristic shows trade-off between:

  • True Positive Rate (sensitivity)

  • False Positive Rate (1 - specificity)

Example ROC Curves:

ROC Curves

One-vs-Rest Multi-class ROC curves for arc fault detection

The ROC curve shows the trade-off between sensitivity and specificity at different thresholds.

TPR (Sensitivity)
1.0 |        ******
    |      **
    |    **
0.5 |  **
    | **
0.0 +*-------------- FPR
    0.0    0.5    1.0

Key Metrics:

  • AUC (Area Under Curve): 1.0 = perfect, 0.5 = random

  • Operating Point: Where you set the threshold

Multi-Class ROC:

For multi-class problems, one-vs-rest ROC shows each class:

Class A: AUC = 0.98
Class B: AUC = 0.95
Class C: AUC = 0.99

8.5.6. Class Score Histograms

Shows distribution of model confidence for each class:

Example Class Score Histogram:

Class Score Histogram

Distribution of class score differences showing model confidence

Correct predictions: [=====|=====] centered at high score
Wrong predictions:   [==|==] centered at low score

Interpretation:

  • Well-separated histograms: Model is confident and correct

  • Overlapping histograms: Model is uncertain

  • Wrong predictions at high scores: Confident mistakes (investigate)

8.5.7. FPR/TPR Thresholds

CSV file for threshold selection:

threshold,tpr,fpr,precision,recall,f1
0.1,0.99,0.15,0.87,0.99,0.93
0.3,0.97,0.08,0.92,0.97,0.94
0.5,0.95,0.03,0.97,0.95,0.96
0.7,0.90,0.01,0.99,0.90,0.94
0.9,0.80,0.00,1.00,0.80,0.89

Using this data:

  1. Choose your priority (minimize FPR or maximize TPR)

  2. Find the threshold that meets your requirement

  3. Use that threshold in deployment code

8.5.8. Classification Report

Per-class performance metrics:

Class     Precision  Recall  F1-Score  Support
Normal    0.98       0.96    0.97      500
Fault_A   0.95       0.97    0.96      480
Fault_B   0.97       0.95    0.96      520

Accuracy: 0.96
Macro Avg: 0.97    0.96    0.96      1500
Weighted Avg: 0.96 0.96    0.96      1500

Metrics explained:

  • Precision: Of predicted positives, how many are correct?

  • Recall: Of actual positives, how many were detected?

  • F1-Score: Harmonic mean of precision and recall

  • Support: Number of samples per class

8.5.9. Error Analysis

Detailed examination of misclassified samples:

testing:
  error_analysis:
    save_errors: True
    max_errors_per_class: 20

Error sample files:

Each saved error includes:

  • Original input data

  • True label

  • Predicted label

  • Model confidence scores

Using error analysis:

  1. Identify patterns in errors

  2. Check for labeling mistakes

  3. Find data collection issues

  4. Improve feature extraction

8.5.10. Quantized vs Float Comparison

Compare quantized model to float baseline:

testing:
  enable: True
  test_float: True
  test_quantized: True
  compare_results: True

Output:

Float32 Model:
Accuracy: 99.2%
F1-Score: 0.992

INT8 Quantized Model:
Accuracy: 98.8%
F1-Score: 0.988

Degradation: 0.4%

8.5.11. Regression Analysis

For regression tasks, different metrics apply:

testing:
  enable: True
  regression_metrics:
    mse: True
    mae: True
    r2: True
    scatter_plot: True

Output:

Mean Squared Error (MSE): 0.023
Mean Absolute Error (MAE): 0.12
R² Score: 0.95
Max Error: 0.45

8.5.12. Anomaly Detection Analysis

For anomaly detection:

testing:
  enable: True
  anomaly_metrics:
    reconstruction_error: True
    threshold_analysis: True

Output:

Normal Data:
Mean reconstruction error: 0.05
Std reconstruction error: 0.02

Anomaly Data:
Mean reconstruction error: 0.35
Std reconstruction error: 0.15

Recommended threshold: 0.15
At threshold 0.15:
TPR: 0.92
FPR: 0.05

8.5.13. Custom Analysis Scripts

For advanced analysis, use the saved model and data:

import torch
import numpy as np

# Load model
model = torch.load('path/to/best_model.pt')
model.eval()

# Load test data
test_data = np.load('path/to/test_data.npy')
test_labels = np.load('path/to/test_labels.npy')

# Run inference
with torch.no_grad():
    outputs = model(torch.tensor(test_data))
    predictions = outputs.argmax(dim=1)

# Custom analysis
# ... your analysis code

8.5.14. Generating Reports

For documentation or stakeholder communication:

testing:
  enable: True
  generate_report: True
  report_format: 'pdf'  # or 'html', 'markdown'

Report includes:

  • Model summary (architecture, parameters)

  • Training curves

  • Test metrics

  • Confusion matrix

  • ROC curves

  • Recommendations

8.5.15. Example: Complete Analysis Configuration

common:
  task_type: 'generic_timeseries_classification'
  target_device: 'F28P55'

dataset:
  dataset_name: 'dc_arc_fault_example_dsk'

data_processing_feature_extraction:
  feature_extraction_name: 'FFT1024Input_256Feature_1Frame_Full_Bandwidth'
  variables: 1

training:
  model_name: 'ArcFault_model_400_t'
  training_epochs: 30
  quantization: 2
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

testing:
  enable: True
  test_float: True
  test_quantized: True
  analysis:
    confusion_matrix: True
    roc_curve: True
    class_histograms: True
    error_analysis: True
    save_errors: True
    max_errors_per_class: 10
  compare_results: True

8.5.16. Best Practices

  1. Always review confusion matrix: Understand error patterns

  2. Check ROC curves: Ensure good class separation

  3. Analyze errors: Learn from misclassifications

  4. Compare quantized: Verify acceptable accuracy drop

  5. Document findings: Record analysis for future reference

8.5.17. Troubleshooting Low Accuracy

If overall accuracy is low:

  • Check GoF test results (dataset quality)

  • Try larger model

  • Increase training epochs

  • Improve feature extraction

If specific classes have low accuracy:

  • Check class balance

  • Investigate error samples

  • May need more data for those classes

  • Classes might be inherently similar

If quantized accuracy drops significantly:

  • Try QAT instead of PTQ

  • Use more calibration data

  • Keep sensitive layers at higher precision

  • Use larger model (more robust to quantization)

8.5.18. Next Steps