Interpretable Control Systems

9/16/2025

By Arya Prakash, Alice Rigg

Table of contents

This work was done under the guidance and direction of Alice Rigg. Alice is a passionate researcher with a strong intuition for interpretability. I am fortunate to have had the opportunity to learn from her.

This post builds on prior work done by Rigg, et al. (2024), which demonstrated that replacing standard MLP activations with bilinear layers makes a model’s computations expressible in terms of weights.

bilinear layers

With bilinear layers, every logit expands into a clean quadratic form that we can interpret directly. The hidden step is:

h = (W x+b_W)\odot(V x+b_V)

Where $\odot$ is elementwise multiplication. Instead of introducing nonlinearity, a bilinear layer multiplies two different linear projections of the state. Each coordinate of $h$ is in the form

h_k = (w_k^\top x+b_{W,k}) \cdot (v_k^\top x+b_{V,k}).

This already encodes a pairwise interaction between weighted sums of the input features.

When these hidden features are combined in the output head

z = Uh + b

Output head: linear combination of hidden features

h

Expanding one logit $z_a(x)$ explicitly:

\begin{aligned} z_a(x) &= \sum_k U_{a,k}\,(w_k^\top x+b_{W,k})(v_k^\top x+b_{V,k}) + b_a \\ &= x^\top \Big(\sum_k U_{a,k}\, w_k v_k^\top\Big)x \\ &\quad+ \Big(\sum_k U_{a,k}[b_{W,k} v_k + b_{V,k} w_k]\Big)^\top x \\ &\quad+ \sum_k U_{a,k}\, b_{W,k} b_{V,k} + b_a \end{aligned}

Expansion of logit

z_a

: isolates quadratic interactions, linear terms, and constants.

This can be simplified neatly as

z_a(x) = x^\top B_a x + \beta_a^\top x + c_a,

Quadratic form:

B_a

captures pairwise interactions,

\beta_a

the linear term, and

c_a

the constant.

where $B_a$ is the quadratic interaction matrix, $\beta_a$ is the linear term, and $c_a$ is the constant.

interaction matrices

The interaction matrix $B_a$ tells us how pairs of features $(x_i, x_j)$ work together to influence logit $a$ . Only the symmetric portion of $B_a$ contributes to the quadratic form*, so we analyze

S_a = \tfrac{1}{2}(B_a+B_a^\top)

Symmetric interaction matrix

S_a

is the portion that affects the quadratic form.

With this, we can directly see which pairs of state features the policy uses. $\\$ *See section 3 of the original paper for a detailed explanation.

modes and decomposition

Because $S_a$ is real and symmetric, the spectral theorem guarantees that it can be decomposed into eigenvalues and eigenvectors:

S_a = \sum_{i=1}^d \lambda_i\, v_i v_i^\top

Spectral decomposition: modes

v_i

with strengths (eigenvalues)

\lambda_i

Where each $v_i$ is an eigenvector and $\lambda_i$ is its corresponding eigenvalue. This makes the quadratic form easy to interpret:

x^\top S_a x = \sum_{i=1}^d \lambda_i\, (v_i^\top x)^2

Mode-wise contribution: squared projection on each mode weighted by

\lambda_i

Each term here corresponds to a mode - a pattern of input features.
The eigenvalue $\lambda_i$ tells us how strongly that pattern influences the logit:

If $\lambda_i > 0$ , the mode amplifies the score when present.
If $\lambda_i < 0$ , the mode suppresses it.

Generally, there will only be a few modes with large magnitudes, meaning the policy can often be summarized into a handful of dominant patterns.

Interpretable Control Systems

related works and background#

bilinear layers#

interaction matrices#

modes and decomposition#

related works and background

bilinear layers

interaction matrices

modes and decomposition