5.4. Data Splitting

This guide explains how Tiny ML Tensorlab splits data into training, validation, and test sets.

5.4.1. Split Methods

There are two methods for splitting data:

1. amongst_files (Default)

Files are divided into different sets:

  • File A goes to training

  • File B goes to validation

  • File C goes to testing

Good when each file represents a distinct experiment or session.

2. within_files

Each file is split internally:

  • First 60% of File A → Training

  • Next 30% of File A → Validation

  • Last 10% of File A → Testing

Good when files contain long continuous sequences.

5.4.2. Configuration

Using split_factor (Auto-split)

dataset:
  dataset_name: 'my_data'
  input_data_path: '/path/to/data'
  split_type: 'amongst_files'     # or 'within_files'
  split_factor: [0.6, 0.3, 0.1]   # train, val, test ratios

Ratios must sum to 1.0.

Using Annotation Files (Manual Split)

Create files in annotations/ folder:

my_dataset/
├── classes/ (or files/)
│   └── ...
└── annotations/
    ├── instances_train_list.txt
    ├── instances_val_list.txt
    └── instances_test_list.txt    # Optional

If annotations exist, they override split_factor.

5.4.3. Annotation File Format

List file paths relative to the data directory, one per line:

instances_train_list.txt:

class_A/sample1.csv
class_A/sample2.csv
class_B/sample1.csv
class_B/sample2.csv

instances_val_list.txt:

class_A/sample3.csv
class_B/sample3.csv

5.4.4. Split Examples

Example 1: 10 Files, amongst_files, [0.6, 0.3, 0.1]

Files: file1.csv through file10.csv

Training (60%):   file1-6.csv (6 files)
Validation (30%): file7-9.csv (3 files)
Testing (10%):    file10.csv (1 file)

Each file retains all its rows.

Example 2: 10 Files, within_files, [0.6, 0.3, 0.1]

Each file has 100 rows

Training:   Rows 0-59 from all 10 files
Validation: Rows 60-89 from all 10 files
Testing:    Rows 90-99 from all 10 files

All files appear in all sets, but different portions.

5.4.5. When to Use Each Method

Use amongst_files when:

  • Each file is a separate experiment

  • Files have different conditions (different subjects, machines)

  • You want to test generalization to new experiments

Use within_files when:

  • Files are very long continuous recordings

  • You want maximum data utilization

  • The data is homogeneous throughout

5.4.6. Best Practices

1. Keep Test Data Truly Held-Out

Never use test data during model development. Only evaluate on test set when you’ve finalized your model.

2. Use Stratified Splits for Classification

For classification, try to maintain class proportions in each split. The auto-split attempts this automatically.

3. Consider Temporal Ordering

For time series, consider whether random splitting makes sense. Sometimes you want earlier data for training, later data for testing.

4. Use Annotation Files for Reproducibility

Manual annotation files ensure the same splits across runs.

5.4.7. Creating Annotation Files

Automatically Generated

Run training once without annotations, then find generated files in the output directory.

Manually Created

Use a script to create deterministic splits:

import os
import random

def create_splits(data_dir, train_ratio=0.6, val_ratio=0.3):
    # List all files
    files = []
    for class_name in os.listdir(os.path.join(data_dir, 'classes')):
        class_dir = os.path.join(data_dir, 'classes', class_name)
        for f in os.listdir(class_dir):
            files.append(f'{class_name}/{f}')

    # Shuffle deterministically
    random.seed(42)
    random.shuffle(files)

    # Split
    n_train = int(len(files) * train_ratio)
    n_val = int(len(files) * val_ratio)

    train_files = files[:n_train]
    val_files = files[n_train:n_train + n_val]
    test_files = files[n_train + n_val:]

    # Write annotation files
    ann_dir = os.path.join(data_dir, 'annotations')
    os.makedirs(ann_dir, exist_ok=True)

    with open(os.path.join(ann_dir, 'instances_train_list.txt'), 'w') as f:
        f.write('\n'.join(train_files))

    with open(os.path.join(ann_dir, 'instances_val_list.txt'), 'w') as f:
        f.write('\n'.join(val_files))

    with open(os.path.join(ann_dir, 'instances_test_list.txt'), 'w') as f:
        f.write('\n'.join(test_files))

5.4.8. Verifying Splits

After training, check the log for split information:

Dataset loaded:
- Training samples: 1200
- Validation samples: 400
- Test samples: 200