kinodb
Robot Trajectory Database
High-performance trajectory data pipeline
from HDF5 to GPU-saturating training batches.
The Problem
Section titled “The Problem”Robot learning has a data layer problem. The models are becoming general, but datasets are still trapped in incompatible storage formats, slow Python loaders, memory-heavy conversions, and benchmark scripts that quietly encode schema assumptions.
Every serious robot learning stack eventually becomes a data systems project. HDF5 stores robomimic and LIBERO trajectories. LeRobot uses Parquet plus media files. RLDS and Open X-Embodiment use TFRecord and TensorFlow conventions. Labs add Zarr, raw folders, sidecar metadata, and custom loaders because none of the existing formats is the shared trajectory database robotics actually wants.
The failure mode is not cosmetic. It shows up as training stalls, out-of-memory loads, duplicated conversions, format drift, collection bottlenecks, and correctness bugs that only appear after someone tries to mix datasets.
| Signal | What it exposed | Why it matters |
|---|---|---|
| LeRobot issue #1623 | Training spending more time in the dataloader than backprop | The data path can dominate wall-clock training |
| LeRobot issue #1346 | Whole-dataset memory pressure and image-heavy fine-tuning | ”Load everything into RAM” is not a robotics-scale plan |
| LeRobot issue #2446 | Many LIBERO variants across format versions | Format drift creates duplicated datasets and brittle loaders |
| LeRobot issue #1434 | Recording-time video encoding bottlenecks | Storage choices affect data collection, not only training |
| Robo-DM, ICRA 2025 | RLDS bloat and slower loading than optimized paths | Robotics data overhead is measurable systems debt |
| RLDS / OXE practice | TensorFlow dependency and underdocumented conventions | PyTorch-heavy labs pay an integration tax |
The Thesis
Section titled “The Thesis”kinodb turns trajectory data into an embedded database. Ingest once, write one
indexed .kdb file, and then query, validate, mix, serve, and train through the
same API regardless of whether the source came from HDF5, LeRobot, or RLDS.
- IngestConvert HDF5, LeRobot Parquet, and RLDS TFRecord into one episode-first format.
- IndexStore contiguous episode payloads plus an end-of-file index for direct random access.
- QueryFilter metadata with KQL instead of hand-writing one-off Python dataset scans.
- TrainRead from Python, NumPy, PyTorch datasets, weighted mixtures, or the gRPC server.
- MeasureValidate exactness, storage size, metadata scans, random reads, and training throughput.
Read This In Order
Section titled “Read This In Order”The left sidebar is organized as a launch path. Start with the problem, build a database, train from it, then use the benchmark and reference sections when you need proof or implementation details.
.kdb, inspect it, query it, and read it from Python.04Ingesting DataBring in robomimic, LIBERO, LeRobot, RLDS, and image-heavy datasets.05KQL QueriesUse metadata filters for selection, merge, and training subsets.06Benchmark ResultsRead the experiment report for training curves, scaling, storage, and correctness.First Minute
Section titled “First Minute”cargo build --release
# Create a small synthetic database.target/release/kino create-test demo.kdb -n 20 --frames 50 --compress 85
# Inspect and query it.target/release/kino info demo.kdbtarget/release/kino schema demo.kdbtarget/release/kino query demo.kdb "success = true AND num_frames > 25"import kinodb
db = kinodb.open("demo.kdb")print(db.summary())
episode = db.read_episode(0)print(episode["actions"].shape)print(episode["states"].shape)Launch Scorecard
Section titled “Launch Scorecard”These are the headline results from the latest launch experiment log. The important part is not only that kinodb is faster, but that it gives robotics teams one data layer for formats that otherwise need separate loaders and scripts.
| Area | Result |
|---|---|
| Training curves | PushT image CNN/MLP: 6.6-6.8x end-to-end; LIBERO spatial CNN/MLP: 7.1-7.7x; ViT remains 2.2-2.4x because compute dominates |
| Interoperability | 4 mixed kinodb sources with equal weights; loader code drops from 26 native LOC to 8 kinodb LOC; mixed run reaches loss 33.6219 |
| Scaling | At 50K episodes: kinodb opens in 1.2ms, sequentially reads in 1.26s, and runs KQL in 31.7ms |
| Storage | State-only 1K episodes: kinodb is 4.52 MB and writes in 18ms; image-heavy runs land at native-size parity |
| Correctness | Prior correctness sweep still records 15/15 datasets exact, action_max_abs_diff = 0.0; keep raw JSON in-repo before paper submission |
What Exists Today
Section titled “What Exists Today”kinodb currently ships a Rust workspace with core storage, ingestion, CLI, gRPC serving, and Python bindings. Some original blueprint items, such as video-segment storage, shared-memory serving, and hardware decode, are future work. The docs mark those as roadmap items instead of pretending they are already implemented.
The docs are written to keep that line clear: what exists today, what was measured, and what is still roadmap.