Skip to content

Dataset Mixtures

Modern VLA training often mixes demonstrations from different robots, tasks, and source formats. kinodb gives you two paths:

  • kino merge creates one physical .kdb file.
  • kino mix and from_mixture() create weighted virtual mixtures.

Use merge when you want a single file to ship, archive, upload, or train sequentially.

Terminal window
kino merge lift.kdb pusht.kdb aloha.kdb --output combined.kdb

Filter while merging:

Terminal window
kino merge lift.kdb pusht.kdb \
--output successful.kdb \
--filter "success = true"

This reads each input episode, applies the optional KQL filter, then writes matching episodes to a new database.

Use mix when you want training-time sampling proportions.

Terminal window
kino mix \
--source bridge.kdb:0.4 \
--source aloha.kdb:0.3 \
--source libero.kdb:0.3

Sample a distribution:

Terminal window
kino mix \
--source bridge.kdb:0.4 \
--source aloha.kdb:0.3 \
--source libero.kdb:0.3 \
--sample 1000 \
--seed 42

Weights are relative. 4:3:3 and 0.4:0.3:0.3 are equivalent.

from kinodb.torch import from_mixture
from torch.utils.data import DataLoader
dataset = from_mixture(
{
"bridge.kdb": 0.4,
"aloha.kdb": 0.3,
"libero.kdb": 0.3,
},
seed=42,
image_size=(224, 224),
)
loader = DataLoader(dataset, batch_size=8)

Each source can come from a different original format. The training code only sees .kdb.

use kinodb_core::Mixture;
let mut mix = Mixture::builder()
.add("bridge.kdb", 0.4)
.add("aloha.kdb", 0.3)
.add("libero.kdb", 0.3)
.seed(42)
.build()?;
let episode = mix.sample()?;
let global_episode = mix.read_global(10)?;
let order = mix.weighted_epoch(1000);

Different datasets can have different action and state dimensions. The experiment history hit this directly when mixing PushT (action_dim = 2) with ALOHA (action_dim = 14): raw torch.stack fails unless the collate function pads or batches by schema.

Common strategies:

StrategyWhen to use
Pad state/action vectors to max dimensionOne model with source-aware masks
Bucket by schemaMulti-embodiment training with separate heads
Train separate adaptersDifferent embodiments have genuinely different action spaces
Physical merge only same-schema dataSimplest archival/distribution path

kinodb preserves dimensions and metadata; it does not hide schema differences from your model.

Use casePick
Publish one converted datasetkino merge
Build a filtered release splitkino merge --filter
Match OpenVLA-style source proportionskino mix or from_mixture()
Change ratios between runsvirtual mixture
Maximize sequential reads from one filephysical merge