Training an imitation learning policy on open robot data

Most robot manipulation policies today are not programmed but learned from demonstrations: a person teleoperates a robot through a task many times, and a model is trained to reproduce that behavior from the robot's own observations. This is imitation learning, and in its simplest form, behavior cloning, it is supervised learning: states in, expert actions out. I wanted to work through the whole loop myself, training, evaluation, and comparison against a reference, rather than just read about it, so this page walks through each step and the decisions behind it. The code is at github.com/gradientsj/robot-imitation-lab.

The system at a glance

Two checkpoints meet in one evaluator: the policy I train from scratch, and the published reference, both rolled out under identical conditions.

Step 1: choosing the task

The task is PushT: a round pusher has to slide a T-shaped block onto a target outline, viewed from above. An episode counts as a success when the block covers at least 95% of the goal region. The per-step reward tracks goal coverage, normalized so that a reward of 1.0 means success, which keeps partial progress visible even when an episode fails.

Why this task: it looks like a toy and is not one. Pushing is contact-rich, the block rotates in ways that are hard to model, and there are usually several valid ways to solve a given starting position, which makes the demonstration data multimodal (the same situation can correctly lead to different actions). It also has an open dataset of 206 human demonstrations on the Hugging Face hub (lerobot/pusht), a standard simulator, and a published reference checkpoint, all small enough to train on the single consumer GPU I have. Open data, matched references, and reproducible scale beat an impressive-sounding task I could not evaluate properly.

Step 2: choosing the policy

The model is a diffusion policy. Instead of predicting one action, it learns to start from random noise and iteratively denoise it into a short sequence of future actions, conditioned on what the robot currently sees. Three terms worth defining:

Action chunking: the policy predicts a window of future actions (16 steps here), executes the first 8, then replans, which makes behavior smoother and tolerant of slow inference.
Observation history: the policy conditions on the last 2 camera frames and positions, not just the current one, so it can see motion.
Denoising: the same idea as image diffusion models, applied to action trajectories, and it represents multimodal behavior naturally because different noise samples can denoise into different valid solutions.

Put together, one cycle of the control loop looks like this:

Why this architecture: the multimodality argument above is the textbook reason. My other reason is that diffusion and its successor, flow matching, are what the current generation of large vision-language-action models (GR00T, pi0, RDT2) use as their action heads. Training one from scratch at 262M parameters is the cheapest way to build real intuition for what those models do at 3B. I wrote a separate survey of the open VLA landscape that maps that connection model by model.

Step 3: the training loop

Training uses Hugging Face's LeRobot library for the dataset and policy classes, but the loop itself is about a hundred readable lines I wrote rather than the framework's trainer command.

Why hand-write the loop: visibility into the places where these pipelines actually fail. The loop is where the dataset schema gets wired to the policy's expected inputs, where observation windows and action horizons are expressed as timestamps, and where normalization happens, and those are exactly the spots that fail without raising an error. The next section covers two such failure modes I had to catch in this project, both of which a framework command would have hidden.

Step 4: normalization, where things go wrong silently

Normalization means rescaling inputs and outputs to ranges neural networks train well on: pixel coordinates in [0, 512] become [-1, 1], images get standardized channel by channel. It sounds like bookkeeping, but it is where imitation pipelines fail silently, and both of the hard-won lessons in this project came from there.

Lesson one: the loss can lie. Libraries move, and the LeRobot version I used had relocated normalization out of the policy into separate processor pipelines, leaving the older style of passing dataset statistics to the policy constructor silently ignored rather than rejected. A training run wired the old way learns on raw pixel coordinates, and the treacherous part is that the training loss still goes down, because the network can fit noise prediction at any consistent scale. Only rollout behavior exposes it. The takeaway shaped the design: every batch goes through the normalization pipeline explicitly, and that pipeline is saved next to the weights so the checkpoint is a complete, self-describing artifact.

Lesson two: statistics are part of the checkpoint. When I first evaluated the published reference checkpoint, it scored 4% success against its reported 65%, while still pushing the block to 0.66 average coverage, which is exactly what makes this failure mode dangerous: it looks like a mediocre policy, not a broken pipeline. The cause turned out to be provenance: that checkpoint normalizes images with ImageNet statistics (mean 0.485/0.456/0.406) rather than PushT dataset statistics (mean around 0.97, since the board is mostly white), so rebuilding normalization from the dataset shifted every pixel by several standard deviations. The evaluator now recovers the exact statistics embedded in the checkpoint file, and treats normalization provenance as part of a checkpoint's contract, because "degraded but functional" is what a statistics mismatch looks like from the outside.

Step 5: evaluation you can trust

Evaluation rolls the policy out in the simulator for 50 episodes and reports the success rate with a Wilson 95% confidence interval, a way of computing the uncertainty range for a proportion that behaves sensibly with small samples. With 50 episodes, a success rate of 60% really means "probably between 46% and 73%", and pretending otherwise is how people fool themselves about a few points of difference.

Some of the design decisions are worth spelling out. The from-scratch policy and the pretrained reference are evaluated under identical conditions, the same 50 random seeds, the same step limit, the same success criterion, because success rates do not transfer across protocols and only matched comparisons mean much. The evaluator also records every episode's maximum coverage, not just the binary outcome, since two policies with the same success rate can fail very differently (always-almost vs sometimes-collapsing), and saves rollout videos so behavior can be seen, not just scored.

Results

Both policies were rolled out on the same 50 seeds with a 300-step limit. The pretrained reference, trained for 200,000 steps, succeeded in 34 of 50 episodes: a 68% success rate with a Wilson 95% interval of 54 to 79% and a mean max reward of 0.957. That reproduces its model card number (65.4% over 500 episodes), which validates the evaluation protocol before it is used to compare anything.

My from-scratch policy, trained for 25,000 steps, about two hours on the one GPU, succeeded in 5 of 50: 10% with an interval of 4 to 21% and a mean max reward of 0.445. The two runs share the architecture, the data, and the evaluation, and they differ by an eightfold optimization budget, which at first looked like the whole explanation. The per-episode distribution makes the gap legible: the undertrained policy shows the full spectrum from clean successes to complete misses, while the reference sits almost entirely at the top.

The natural follow-up was to grant my run the same budget, so I trained a second policy for the full 200,000 steps with the reference's warmup schedule, saving checkpoints along the way and evaluating each one on the same 50 seeds. Success climbed from 10% to 28% by 150,000 steps and then fell back to 14% at the end, even as the training loss kept improving the whole way down. Steps were clearly not the missing ingredient, and a diff of the two checkpoints' configurations pointed at what was: the reference trained with random 84 by 84 crop augmentation, while the library version I used had quietly changed that default to no cropping at all. With only 206 demonstrations, a vision encoder without augmentation is free to memorize the training frames, and a success curve that peaks mid-run and then regresses while the loss improves is what that looks like from the outside.

So I ran the experiment that the diagnosis demanded: the identical 200,000-step recipe with the single missing flag restored. The effect was unambiguous. Every milestone gained thirty or more points (38% at 50,000 steps, 58% at 100,000, 68% at 150,000, 56% at the end), and from 100,000 steps onward the curve sits inside the reference's confidence band, with the final checkpoints at 56% versus 68%, a difference that 50 episodes cannot statistically separate. One library default was worth a factor of four in task success, and nothing on the training side ever hinted at it, which is as clean a case as I have seen for evaluating policies by rolling them out rather than by watching the loss. The repo README carries the full scaling curves for both runs, the comparison figures, rollout clips of the policies on the same seed, and the training logs.

Limitations

This is a single task and a single embodiment in simulation, learned from 206 demonstrations. It says nothing about sim-to-real transfer, and a 50-episode evaluation bounds how confidently any two policies can be separated. What it does demonstrate, end to end, is dataset handling, a correct training loop, checkpoint hygiene, and an evaluation methodology that states its uncertainty.