Imitation learning vs RL for control

Notes · learning control policies · Oct 2023

There are two broad ways to learn a control policy: copy an expert, or learn from reward. They fail in different ways, and knowing which is which decides a lot of a robotics project.

Imitation learning

The simplest form, behavior cloning, treats control as supervised learning: given states and the expert's actions, fit a policy that maps one to the other. It is fast, stable, and needs no reward function. Its weakness is distribution shift: the moment the policy drifts off the expert's trajectory, it sees states it never trained on, makes a worse choice, drifts further, and errors compound.

DAgger fixes this by iterating, run the policy, ask the expert what it should have done in the states it actually visited, and add those to the dataset. Modern imitation often uses a diffusion policy, which models the distribution of good actions and handles multi-modal behavior well.

Reinforcement learning

RL needs no demonstrations, only a reward signal, and can discover behavior no expert demonstrated. The cost is sample efficiency: it can take millions of interactions to converge, which is why so much RL for robotics happens in simulation. And the hard part is rarely the algorithm, it is reward design: a sloppy reward gets gamed in ways that are obvious only in hindsight.

When to use which

  • Have good demonstrations and want results fast? Start with imitation learning.
  • No demos, or need to beat the expert? Reach for RL, usually in sim.
  • Best of both: pretrain with imitation, then fine-tune with RL, so you start from competent behavior instead of random flailing.

Either way, the last mile is sim-to-real: a policy that works in simulation still has to survive real sensors, latency, and dynamics it never saw in training.

Step by step: tabular Q-learning

  1. Initialize a table Q[state, action] to zeros.
  2. Each step, pick an action epsilon-greedily (explore, otherwise exploit).
  3. Take the action and observe the next state and reward.
  4. Move Q toward reward plus the discounted best next value (the TD update).
  5. Repeat across episodes; Q converges to the optimal action values.

Behavior cloning, the imitation counterpart, skips reward entirely and just fits a policy to expert state-action pairs with supervised learning.

import numpy as np

def q_learning(env, episodes=500, alpha=0.1, gamma=0.99, eps=0.2):
    Q = np.zeros((env.n_states, env.n_actions))
    for _ in range(episodes):
        s, done = env.reset(), False
        while not done:
            # epsilon-greedy: explore sometimes, else take the best known action
            a = env.sample_action() if np.random.rand() < eps else int(Q[s].argmax())
            s2, r, done = env.step(a)
            # temporal-difference update toward reward + discounted future value
            Q[s, a] += alpha * (r + gamma * Q[s2].max() - Q[s, a])
            s = s2
    return Q

References