Interpretable Control Systems
Table of contents
This work was done under the guidance and direction of Alice Rigg. Alice is a passionate researcher with a strong intuition for interpretability. I am fortunate to have had the opportunity to learn from her.
related works and background
This post builds on prior work done by Rigg, et al. (2024), which demonstrated that replacing standard MLP activations with bilinear layers makes a model’s computations expressible in terms of weights.
bilinear layers
With bilinear layers, every logit expands into a clean quadratic form that we can interpret directly. The hidden step is:
Where is elementwise multiplication. Instead of introducing nonlinearity, a bilinear layer multiplies two different linear projections of the state. Each coordinate of is in the form
This already encodes a pairwise interaction between weighted sums of the input features.
When these hidden features are combined in the output head
Output head: linear combination of hidden features .Expanding one logit explicitly:
Expansion of logit : isolates quadratic interactions, linear terms, and constants.This can be simplified neatly as
Quadratic form: captures pairwise interactions, the linear term, and the constant.where is the quadratic interaction matrix, is the linear term, and is the constant.
interaction matrices
The interaction matrix tells us how pairs of features work together to influence logit . Only the symmetric portion of contributes to the quadratic form*, so we analyze
Symmetric interaction matrix is the portion that affects the quadratic form.With this, we can directly see which pairs of state features the policy uses. *See section 3 of the original paper for a detailed explanation.
modes and decomposition
Because is real and symmetric, the spectral theorem guarantees that it can be decomposed into eigenvalues and eigenvectors:
Spectral decomposition: modes with strengths (eigenvalues) .Where each is an eigenvector and is its corresponding eigenvalue. This makes the quadratic form easy to interpret:
Mode-wise contribution: squared projection on each mode weighted by .Each term here corresponds to a mode - a pattern of input features.
The eigenvalue tells us how strongly that pattern influences the logit:
- If , the mode amplifies the score when present.
- If , the mode suppresses it.
Generally, there will only be a few modes with large magnitudes, meaning the policy can often be summarized into a handful of dominant patterns.