Skip to content

Correctness

Correctness matters more than speed. A trajectory database is not useful if it silently changes actions, states, frame counts, or task labels.

The final benchmark history reports:

CheckResult
Datasets checked15
Exact datasets15/15
action_max_abs_diff0.0 on every dataset
robomimic issuefixed by numeric demo sorting in the benchmark
Image payload statuspresent after image-ingest fixes; one remaining “Images: no” label was a reporting bug

For each dataset, the benchmark compared native reads against .kdb reads for sampled episodes.

FieldComparison
Actionselementwise max absolute difference
Stateselementwise max absolute difference when available
Episode lengthnative frame count vs .kdb frame count
Imagespayload presence, shape, and camera accounting
Metadatatask, embodiment, action dimension, frame count

The biggest correctness scare was robomimic reporting inf action differences. The root cause was not corrupted data. It was ordering:

lexicographic: demo_0, demo_1, demo_10, demo_100, demo_2
numeric: demo_0, demo_1, demo_2, demo_3, demo_4

kinodb ingests HDF5 demo groups in numeric order. The benchmark originally compared against native HDF5 groups in lexicographic order, so it was comparing different episodes after the first few demos.

Fix: sort demo keys by the integer after demo_.

HDF5 observation groups often contain many low-dimensional state keys:

obs/
robot0_eef_pos
robot0_eef_quat
robot0_gripper_qpos
object

kinodb concatenates all 2D state keys in sorted order. Native benchmark code must do the same to compare state vectors fairly.

For image datasets, the storage story changed during development:

  1. LeRobot image struct columns were initially skipped.
  2. Image extraction was added.
  3. Raw RGB storage caused huge files.
  4. JPEG/PNG pass-through fixed storage parity.
  5. The benchmark image detector still had a reporting issue in one summary, even though data was present.

Current reader behavior:

  • raw image payloads are returned as raw bytes,
  • compressed JPEG/PNG payloads are decoded to raw RGB on read_episode(),
  • decode failures skip that frame image rather than crashing the entire episode read.

Use the CLI validator before expensive training:

Terminal window
kino validate data.kdb
kino validate data.kdb --verbose

It checks:

  • header and index parse,
  • header episode count vs index length,
  • total frame count vs index entries,
  • per-episode metadata decode,
  • full episode decode,
  • action dimension consistency,
  • state dimension consistency across frames,
  • NaN/Inf warnings,
  • image byte length vs dimensions,
  • terminal-frame warning.

The launch docs preserve the conclusions, but a paper-ready repo should commit:

  • raw benchmark JSON,
  • exact dataset versions and HuggingFace revisions,
  • native loader code,
  • .kdb ingest commands,
  • correctness comparison code,
  • environment metadata,
  • generated tables and plots.

That makes the correctness claim auditable instead of anecdotal.