SLAM: mapping while localizing

Notes · simultaneous localization and mapping · Mar 2024

SLAM solves a chicken-and-egg problem: you need a map to know where you are, and you need to know where you are to build a map. SLAM estimates both at once, from the robot's own sensors, and it is what lets a robot operate somewhere it has never been.

Front-end: perception

The front-end turns raw sensor data into geometric constraints. For visual SLAM that means detecting and matching features across frames; for lidar it means aligning point clouds. The key, error-prone step is data association, deciding that a feature in this frame is the same landmark seen earlier.

Data association in code

Data association, deciding which observation corresponds to which earlier landmark, is the step that makes or breaks SLAM, because a single wrong match corrupts the whole optimization. In feature-based visual SLAM you match each frame's feature descriptors (vectors such as ORB or SIFT) to the previous frame's by nearest neighbor, then keep only confident matches with Lowe's ratio test: accept a match only when the nearest descriptor is much closer than the second nearest, which discards ambiguous matches in repetitive texture. A mutual check (i picks j and j picks i) and a geometric RANSAC verification on the surviving pairs remove the remaining outliers.

import numpy as np

def match_descriptors(desc1, desc2, ratio=0.75):
    """Nearest-neighbor matching of descriptor rows with Lowe's ratio test."""
    # pairwise squared L2 distances, shape (N1, N2)
    d = (np.sum(desc1**2, axis=1)[:, None]
         + np.sum(desc2**2, axis=1)[None, :]
         - 2 * desc1 @ desc2.T)
    matches = []
    for i in range(len(desc1)):
        order = np.argsort(d[i])
        best, second = order[0], order[1]
        # keep only if the nearest is much closer than the 2nd nearest
        if d[i, best] < (ratio ** 2) * d[i, second]:
            matches.append((i, int(best)))
    return matches

# two clear matches survive; an ambiguous query is rejected:
# match_descriptors(queries, prev_frame)  ->  [(0, 2), (1, 4)]

Back-end: optimization

The back-end fuses those constraints into a consistent estimate of the trajectory and map. Early systems used filters (an EKF over poses and landmarks); modern systems pose it as a graph and run nonlinear optimization, bundle adjustment in the visual case, which scales better and is more accurate.

Bundle adjustment in depth

Bundle adjustment is the back-end's heavy hitter: it jointly refines all the camera poses and all the 3D point positions at once so they best explain every observed feature. What it minimizes is the total reprojection error, the squared pixel distance between where each 3D point was actually observed and where the current estimate says it should project to:

$$E = \sum_{i}\sum_{j} v_{ij}\,\big\| x_{ij} - \pi(K,\,R_i,\,t_i,\,X_j) \big\|^2,$$

where $X_j$ is a 3D point, $(R_i, t_i)$ is camera $i$'s pose, $K$ its intrinsics, $\pi$ the projection of a point into pixels, and $v_{ij}$ is 1 if camera $i$ saw point $j$. This is a large nonlinear least-squares problem, solved iteratively with Levenberg-Marquardt (a blend of Gauss-Newton and gradient descent). It scales because of sparsity: each point is seen by only a few cameras, so the Jacobian is mostly zeros, and the Schur complement exploits that block structure to solve each step efficiently. The same computation is called structure-from-motion when done offline, and it is the accuracy backbone of modern visual SLAM.

Loop closure

Small errors accumulate, so the estimate drifts. Loop closure recognizes a previously visited place and adds a constraint that snaps the whole trajectory back into alignment. It is the difference between a map that slowly warps and one that stays globally consistent.

Visual SLAM

Systems like ORB-SLAM run this whole pipeline from a single camera in real time. It connects directly to the monocular depth and visual odometry work: estimating motion and structure from images is the perception layer SLAM is built on.

Step by step: ICP (scan matching)

  1. For each source point, find its nearest destination point.
  2. Compute the best rigid rotation and translation for those pairs (SVD / Kabsch).
  3. Apply the transform to the source points.
  4. Repeat until it converges.
  5. ICP is the scan-matching core of a SLAM front-end.
import numpy as np

def best_fit_transform(A, B):
    """Rigid R, t that best aligns paired points A onto B (Kabsch / SVD)."""
    ca, cb = A.mean(0), B.mean(0)
    Hm = (A - ca).T @ (B - cb)
    U, _, Vt = np.linalg.svd(Hm)
    R = Vt.T @ U.T
    if np.linalg.det(R) < 0:          # fix reflection
        Vt[-1] *= -1
        R = Vt.T @ U.T
    return R, cb - R @ ca

def icp(src, dst, iters=20):
    src = src.copy()
    for _ in range(iters):
        # 1. nearest-neighbor correspondence
        idx = np.array([np.argmin(((dst - p) ** 2).sum(1)) for p in src])
        # 2. best rigid transform for those pairs, then apply it
        R, t = best_fit_transform(src, dst[idx])
        src = src @ R.T + t
    return src

Aligning point sets: Kabsch and SVD

The best_fit_transform above is the Kabsch algorithm: given two sets of paired points, it finds the rigid rotation $R$ and translation $t$ that minimize the summed squared distance $\sum_k \| (R a_k + t) - b_k \|^2$. The recipe is short: center both sets on their centroids, form the cross-covariance $H = A^\top B$, and take its singular value decomposition.

The singular value decomposition (SVD) factors any matrix as $H = U \Sigma V^\top$, where $U$ and $V$ are orthogonal (pure rotations or reflections) and $\Sigma$ is diagonal with nonnegative singular values that measure how much $H$ stretches along each axis. Geometrically, then, every linear map is a rotation, an axis-aligned scaling, and another rotation. Kabsch keeps only the rotational parts: the optimal rotation is $R = V U^\top$, with a sign flip on the last column of $V$ when $\det(R) < 0$ to rule out a mirror reflection, and the translation is $t = \bar b - R\,\bar a$ from the two centroids. That is exactly the four lines in the code above.

References