Interpretable Control Systems

9/16/2025
By Arya Prakash, Alice Rigg
Table of contents

This work was done under the guidance and direction of Alice Rigg. Alice is a passionate researcher with a strong intuition for interpretability. I am fortunate to have had the opportunity to learn from her.

This post builds on prior work done by Rigg, et al. (2024), which demonstrated that replacing standard MLP activations with bilinear layers makes a model’s computations expressible in terms of weights.

bilinear layers

With bilinear layers, every logit expands into a clean quadratic form that we can interpret directly. The hidden step is:

h=(Wx+bW)(Vx+bV)h = (W x+b_W)\odot(V x+b_V)

Where \odot is elementwise multiplication. Instead of introducing nonlinearity, a bilinear layer multiplies two different linear projections of the state. Each coordinate of hh is in the form

hk=(wkx+bW,k)(vkx+bV,k).h_k = (w_k^\top x+b_{W,k}) \cdot (v_k^\top x+b_{V,k}).

This already encodes a pairwise interaction between weighted sums of the input features.

When these hidden features are combined in the output head

z=Uh+bz = Uh + b Output head: linear combination of hidden features hh.

Expanding one logit za(x)z_a(x) explicitly:

za(x)=kUa,k(wkx+bW,k)(vkx+bV,k)+ba=x(kUa,kwkvk)x+(kUa,k[bW,kvk+bV,kwk])x+kUa,kbW,kbV,k+ba\begin{aligned} z_a(x) &= \sum_k U_{a,k}\,(w_k^\top x+b_{W,k})(v_k^\top x+b_{V,k}) + b_a \\ &= x^\top \Big(\sum_k U_{a,k}\, w_k v_k^\top\Big)x \\ &\quad+ \Big(\sum_k U_{a,k}[b_{W,k} v_k + b_{V,k} w_k]\Big)^\top x \\ &\quad+ \sum_k U_{a,k}\, b_{W,k} b_{V,k} + b_a \end{aligned} Expansion of logit zaz_a: isolates quadratic interactions, linear terms, and constants.

This can be simplified neatly as

za(x)=xBax+βax+ca,z_a(x) = x^\top B_a x + \beta_a^\top x + c_a, Quadratic form: BaB_a captures pairwise interactions, βa\beta_a the linear term, and cac_a the constant.

where BaB_a is the quadratic interaction matrix, βa\beta_a is the linear term, and cac_a is the constant.

interaction matrices

The interaction matrix BaB_a tells us how pairs of features (xi,xj)(x_i, x_j) work together to influence logit aa. Only the symmetric portion of BaB_a contributes to the quadratic form*, so we analyze

Sa=12(Ba+Ba)S_a = \tfrac{1}{2}(B_a+B_a^\top) Symmetric interaction matrix SaS_a is the portion that affects the quadratic form.

With this, we can directly see which pairs of state features the policy uses. \\ *See section 3 of the original paper for a detailed explanation.

modes and decomposition

Because SaS_a is real and symmetric, the spectral theorem guarantees that it can be decomposed into eigenvalues and eigenvectors:

Sa=i=1dλiviviS_a = \sum_{i=1}^d \lambda_i\, v_i v_i^\top Spectral decomposition: modes viv_i with strengths (eigenvalues) λi\lambda_i.

Where each viv_i is an eigenvector and λi\lambda_i is its corresponding eigenvalue. This makes the quadratic form easy to interpret:

xSax=i=1dλi(vix)2x^\top S_a x = \sum_{i=1}^d \lambda_i\, (v_i^\top x)^2 Mode-wise contribution: squared projection on each mode weighted by λi\lambda_i.

Each term here corresponds to a mode - a pattern of input features.
The eigenvalue λi\lambda_i tells us how strongly that pattern influences the logit:

Generally, there will only be a few modes with large magnitudes, meaning the policy can often be summarized into a handful of dominant patterns.