How convolutional networks see images

Convolutional networks are the architecture that made deep learning work on images, and a CNN or one of its descendants still sits behind most production vision, including the classifier in my visual defect inspection project. Their core idea is to bake the structure of images into the network: nearby pixels are related, and a useful feature, an edge or a texture, is useful wherever it appears.

Those two priors, locality and translation invariance, are exactly what a convolution encodes, which is why CNNs need far less data and compute than a fully-connected network to reach the same accuracy on images.

Convolution

A convolution slides a small learnable filter (kernel) across the image and, at each position, computes a dot product between the kernel and the patch beneath it, $\;\text{out}[i,j] = \sum_{u,v} \text{image}[i+u,\,j+v]\,\cdot\,\text{kernel}[u,v]$. The same kernel weights are reused at every location, which buys two things. Weight sharing cuts the parameter count from one weight per input-output pixel pair down to just the kernel size, and translation equivariance means shifting the input shifts the output the same way, so a feature is detected wherever it occurs rather than having to be relearned per location.

Channels, stride, and padding

Real layers stack many kernels to produce many output channels, each a different learned feature map, and they read across the input's channels too, so a layer's weight tensor is (out channels, in channels, kernel height, kernel width). Stride is how far the kernel hops, and a stride above one downsamples; padding adds a border so the output can keep the input size. The output side length is $\left\lfloor (H - K + 2P)/S \right\rfloor + 1$, a formula you end up deriving on a whiteboard surprisingly often.

Pooling and the receptive field

Pooling, taking the max or average over a small window, and strided convolutions both downsample the feature maps. The reason to do this is twofold: it cuts compute as the network deepens, and it grows each later unit's receptive field, the region of the original image it can see, so deep layers integrate global context out of purely local operations. Early layers fire on edges and textures; deeper layers compose those into parts and whole objects.

Why stack small kernels

VGG (Simonyan & Zisserman 2014) made a now-standard observation: two stacked 3x3 convolutions have the same receptive field as one 5x5 but use fewer parameters and add an extra nonlinearity, so depth built from small kernels is both cheaper and more expressive. That is why 3x3 became the default kernel size.

Batch normalization

Batch normalization (Ioffe & Szegedy 2015) normalizes each layer's activations to zero mean and unit variance over the batch, then rescales with learned parameters. The reason it helps is that it keeps the distribution of inputs to each layer stable as training shifts the weights, which permits much higher learning rates and faster, more robust convergence. An easily-forgotten detail: at training time it uses batch statistics, but at inference it must use the running averages accumulated during training, so forgetting to switch to eval mode quietly corrupts single-sample predictions.

Going deep: residual connections

Naively stacking more layers eventually makes accuracy worse, not from overfitting but because very deep plain networks are hard to optimize. ResNet (He et al. 2015) fixed this with residual connections, $y = x + F(x)$, so a block only has to learn a correction to the identity and gradients flow straight through the skip. It is the same gradient-flow argument as in transformers, and it is what made hundreds of layers trainable, unlocking the depth that drove the ImageNet era.

Efficiency: separable convolutions and scaling

For deployment, depthwise-separable convolutions (MobileNet, Howard et al. 2017) factor a standard convolution into a per-channel spatial filter followed by a 1x1 cross-channel mix, cutting cost by roughly the square of the kernel size at little accuracy loss. EfficientNet (Tan & Le 2019; EfficientNetV2, 2021) then showed depth, width, and input resolution should be scaled together by a single compound coefficient rather than one at a time.

CNNs, transformers, and ConvNeXt

Vision transformers (Dosovitskiy et al. 2020) split an image into patches and apply attention, and match or beat CNNs given enough data, because they trade the convolution's built-in locality prior for flexibility. ConvNeXt (Liu et al. 2022) replied by modernizing a pure CNN with the transformers' training recipe and design choices, matching them while keeping convolution's efficiency. The practical upshot is that the gap is mostly about scale and training, so the locality prior still earns its keep when data is limited, which is the regime most applied projects live in.

Step by step: 2D convolution

Optionally pad the image.
Compute the output size from the kernel size and stride.
Slide the kernel over every position.
At each position, multiply element-wise and sum.
Real layers add channels, many filters, and a bias; the core is this slide, multiply, sum.

import numpy as np

def conv2d(image, kernel, stride=1, padding=0):
    if padding:
        image = np.pad(image, padding)
    kh, kw = kernel.shape
    H, W = image.shape
    out_h = (H - kh) // stride + 1
    out_w = (W - kw) // stride + 1
    out = np.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            region = image[i*stride:i*stride+kh, j*stride:j*stride+kw]
            out[i, j] = np.sum(region * kernel)   # element-wise multiply, sum
    return out

Complexity (time and space)

A convolutional layer costs $O(H \cdot W \cdot C_\text{in} \cdot C_\text{out} \cdot K^2)$ time with $C_\text{in} C_\text{out} K^2$ parameters, and crucially the parameter count is independent of image size thanks to weight sharing, which is the central advantage over a dense layer. Depthwise-separable convolutions cut the compute by about a factor of $K^2$. Pooling and stride shrink $H$ and $W$ for later layers, keeping deep networks affordable.

Worked example

A two-wide kernel that subtracts adjacent columns acts as a vertical-edge detector. On an image with a sharp edge between columns 1 and 2, the response spikes exactly at the boundary:

import numpy as np

img = np.array([[0, 0, 1, 1],
                [0, 0, 1, 1],
                [0, 0, 1, 1],
                [0, 0, 1, 1]], float)        # vertical edge between columns 1 and 2

print(conv2d(img, np.array([[-1.0, 1.0]])))  # difference of adjacent columns
# [[0. 1. 0.]
#  [0. 1. 0.]
#  [0. 1. 0.]
#  [0. 1. 0.]]    the column of 1s marks exactly where the edge is

Follow-up questions

Why do CNNs need less data than dense nets on images? Weight sharing and locality encode the priors that nearby pixels relate and features are position-independent, so the network does not have to learn those from data.
Why prefer two 3x3 convolutions over one 5x5? Same receptive field, fewer parameters, and an extra nonlinearity, so more expressive at lower cost.
What problem do residual connections solve? Very deep plain networks are hard to optimize and degrade; the skip lets a block learn a correction to the identity and lets gradients flow, enabling great depth.
Why switch BatchNorm to eval mode at inference? Training uses per-batch statistics; inference must use the running averages, or single-sample predictions are corrupted.
How do depthwise-separable convolutions save compute? They split a convolution into a per-channel spatial filter and a 1x1 channel mix, cutting cost by about the square of the kernel size.
When do CNNs still beat transformers? With limited data, where the convolution's locality prior is an advantage the data cannot supply.

References

LeCun et al., Gradient-Based Learning Applied to Document Recognition (LeNet, 1998).
Krizhevsky et al., ImageNet Classification with Deep CNNs (AlexNet, 2012).
Simonyan & Zisserman, Very Deep Convolutional Networks (VGG, 2014).
Ioffe & Szegedy, Batch Normalization (2015).
He et al., Deep Residual Learning (ResNet, 2015).
Howard et al., MobileNets (2017).
Tan & Le, EfficientNet (2019).
Dosovitskiy et al., An Image is Worth 16x16 Words (ViT, 2020).
Tan & Le, EfficientNetV2 (2021).
Liu et al., A ConvNet for the 2020s (ConvNeXt, 2022).