From HiPPO to Mamba: A Beginner’s Guide to State-Space Models • Jinsung Lee’s Personal Homepage

It was around May 2024 when I first started to have interest in State-Space Models (SSMs).

Like many other research labs around the world, my lab also obligates students to hold a seminar every week. This time, it was my turn, and I again found myself facing the familiar concern:

😩 Which paper should I present this time?

One of my labmates (Seungwook) told me Mamba was getting a lot of hype these days and suggested I present this paper so we could all get a better sense of what it’s about.

It took me a whole week to thoroughly understand how it works, but once I did, I found it very interesting and thought this might end up shaping the direction of my Ph.D research in a big way.

So in this post, I would like to share what I’ve learned — not just about Mamba, but also about the broader family of models it belongs to: State-Space Models (SSMs). I might get a few things wrong, so if you spot any mistakes, feel free to let me know in the comments!

How SSMs have evolved

I genuinely believe that no one can truly understand how Mamba works — or why it performs so well on tasks requiring long-term context — without knowing how it has evolved.

In fact, Mamba is the product of a 4-year journey led by Albert Gu and his group. Once you trace the core ideas developed over time, everything starts to make a lot more sense. Here’s the progression that helped me the most.

So let’s start from the beginning — HiPPO.

HiPPO: Recurrent Memory with Optimal Polynomial Projections

HiPPO is basically a framework that tries to memorize streaming input as a set of coefficients.

It is NOT a learnable neural network, but rather a mathematical method designed to compress and update the history of inputs over time in a way that’s friendly to machine learning models.

You can think of it like a smart notepad that doesn’t write down every single word you hear, but instead constantly updates a running summary as new information comes in. The interesting point of this “summary” is that it is capable of recovering every single word of what you hear with comparably fewer errors, thanks to the mathematical properties of basis functions that often appears in approximation theory.

Before we start, let’s begin with defining the problem setup in mathematical language.

Let the inputs streamed until the timestamp

t

be expressed as

f_0, f_1, f_2, \cdots, f_t

,
then the problem HiPPO has faced can be stated as:

how the “summary” $c(t)$ of $\{f_k \}_{k=1,2,\cdots,t}$ should be defined in the first place to ensure the input reconstruction can be done with fewer errors, and
how it should be updated from $c(t)$ to $c(t+1)$ as new information $f_{t+1}$ comes in.

Keep in mind that the first and second points should be considered together.

For example, one can try implementing an existing compression technique such as MP3 or JPEG to define $c(t)$ to utilize their lossy-but-effective reconstruction ability, but this makes the second part difficult; you might need to decompress $c(t)$ , add $f_{t+1}$ to the input, and then compress them into $c(t+1)$ , which is a tedious and inefficient job to do. Conversely, one can ease the second part by setting $c(t)$ as the cumulated input’s mean $\frac{1}{t}\sum_{k=1}^{t}f_k$ and update it to $c(t+1)$ by $\frac{t}{t+1}c(t) + \frac{1}{t+1}f_{t+1}$ , but you can’t meet the first condition since it is not possible to recover the input sequence solely from its mean.

To address both points together, the authors came up with an elegant idea of using orthogonal polynomials — a family of basis functions that are commonly used in the field of differential equations or approximation theory.

Here’s how authors defined $c(t)$ :

First, treat $f_k$ ’s as samples of a continuous function $f(x)$ : $f_k = f(k\Delta) :=f(x_k)$ .
Then, let $c(t) \in \mathbb{R}^{N}$ be the coefficients obtained by projecting the truncated function $f_{x\leq x_t}$ onto orthogonal polynomials $P_1, P_2, \cdots P_N$ .

These orthogonal polynomials $\{P_n\}$ can serve as basis functions, meaning that one can approximate any (reasonably nice) function $f$ with a linear combination of $P_n$ ’s. If you allow more basis functions (i.e., larger $N$ ), the approximation becomes more accurate — potentially as accurate as you need.

If you are familiar with Taylor series or Fourier transform, this idea will probably feel familiar: it’s a similar concept, just with a different set of basis functions.

Then, why orthogonal polynomials?

This is where the second partShow information for the linked content comes in: it offers a very simple update rule, thanks to orthogonal polynomial’s favorable properties related to its derivatives.

The authors derived the coefficient’s derivative $\frac{dc(t)}{dt}$ by the followings: (from the Appendix C.)

\begin{aligned} c_n(t) &= \langle f_{x\leq{x_t}}, P_n^{(t)} \rangle_{\nu^{(t)}} \\ &\qquad \qquad ... \\ &= \zeta(t)^{-\frac{1}{2}} \lambda_n \int f p_n^{(t)} \frac{\omega^{(t)}}{\chi^{(t)}}. \\ \frac{d}{dt} c_n(t) &= \zeta(t)^{-\frac{1}{2}} \lambda_n \int f(x) \left( \frac{\partial}{\partial t} p_n(t,x) \right) \frac{\omega}{\chi}(t,x) \, dx \\ &\quad + \int f(x) \left( \zeta^{-\frac{1}{2}} \lambda_n p_n(t,x) \right) \left( \frac{\partial}{\partial t} \frac{\omega}{\chi}(t,x) \right) \, dx \\ &\qquad \qquad \ldots \text{(now apply classic measures)} \ldots \\ &= -\mathbf{a}_n^{\top}c(t) + b_nf(t) \\ \Longrightarrow \frac{d}{dt}c(t) &= -Ac(t) + Bf(t) \qquad \bigl(A\in\mathbb{R^{N\times N}}, B\in \mathbb{R^{N\times 1}} \bigr) \end{aligned}

Yes, it involves somewhat difficult derivations and higher-level math — but the surprising point is,
it ended up introducing a very simple linear update rule.

What does this tell us?

It means that if we have $c(t)$ , which summarizes the input up to the time $t$ , we can describe how it changes within an infinitesimally small time window (by the definition of derivative) using the current summary $c(t)$ and the function value $f(t)$ .

Then we can “discretize” this equation, which means you extend this update rule outside of infinitesimally small time window to a visible, practically bigger time window while allowing some error. You’ll see very soon why we need this “discretization” process.

One well known discretization method is “Euler method”, which is shown below.

Assume you only know the function value at point $\text{A}_0$ and the derivative of the blue function while having to evaluate its value at other distant point, say, $\text{A}_4$ .

One might take a very small step from $\text{A}_0$ following its direction (we know it by its derivative), arriving at point $\text{A}_1$ . Although erroneous, this takes you to somewhere that is not so far from the point you want to approximate, and repeating process enables you to reach somewhere near the desired point $\text{A}_4$ .

In our case, this concept can be applied to evaluate $c(t+\Delta)$ by using $c(t)$ and $\frac{dc(t)}{dt}$ .

\begin{aligned} c(t+\Delta) &\approx c(t) + \Delta \frac{dc(t)}{dt} \qquad \text{(Euler method)} \\ &= c(t) + \Delta\bigl(-Ac(t) + Bf(t)\bigr) \\ &= \bigl(I-\Delta A\bigr)c(t) + \Delta Bf(t) \\ & := \bar{A}c(t) + \bar{B}f(t) \end{aligned}

Was a long way around, but this linear update becomes how HiPPO handles the second partShow information for the linked content.
(In the Appendix B, you can see the authors explored various discretization strategies apart from Euler method)

With this update rule, HiPPO can approximate a continuous function $f$ that penetrates the cumulated input points $\bigl(k\Delta, f(k\Delta)\bigr)$ .

Below briefly illustrates how $c(t)$ is updated as well as the corresponding function $f$ .

A few slides from my seminar that might help your understanding.

The authors experimented a reconstruction experiment: while allowing the number of coefficients $N$ up to 256, HiPPO took $10^6$ elements from continuous-time band-limited white noise process and succeeded to reconstruct the whole input with relatively small error compared to other RNN baselines.

This is quite intriguing, since the parameters $A$ and $B$ are NOT learnable parameters: they are simply a matrix and a vector filled with constant numbers determined by which orthogonal polynomial you choose. So, we can conclude that this math-based update offers greater advantage over letting a neural network to autonomously learn which update is optimal for remembering every token it has taken.

There are various types of orthogonal polynomials, e.g., Legendre polynomials, Laguerre polynomials, or Chebyshev polynomials, and the authors explicitly derived various $A$ s and $B$ s that correspond to each of them.

Then how can this be actually used for neural network?

The authors integrated the HiPPO update into RNNs, which typically maintain a hidden state $h_t$ that summarizes the sequence of inputs up to time $t$ . In this setup, the hidden state $h_t$ is linearly projected to produce the input $f_t$ for HiPPO, and the corresponding coefficients $c_t$ are updated over time.

Then, instead of feeding just the current input $x_t$ into the RNN’s update function $\tau$ , they used the concatenated input $[c_{t-1}, x_t]$ . This modification led to performance improvements on tasks that require learning long-term dependencies, showing that HiPPO summary provided useful additional context for remembering histories.

I spent quite a bit of time explaining HiPPO, but it was because a solid understanding of HiPPO is crucial for grasping the rest of the state-space models that follow.

In the next paper (which I’ll refer to as LSSL for short), the authors take things a step further:
They now generalize $A$ and $B$ , turning them into learnable parameters.

Combing Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers (LSSL)

Albert and his group clearly had a strong desire to extend the strengths of HiPPO into neural network — not just by plugging it into existing architectures, but by building an entirely new kind of network around it.

They propose linear state-space layer (LSSL), one that captures HiPPO’s core ideas as a standalone neural architecture, and something that hadn’t really been explored before.

To begin with, LSSL is suggested as a sequence-to-sequence layer,
which transforms a sequence $\mathbf{u}\in \mathbb{R}^{L}$ into a sequence of same length $\mathbf{y} \in \mathbb{R}^{L}$ .

Its overview is illustrated well in Figure 1:

Mind the notation change from HiPPO’s formulation:

The input $f_t$ is now replaced with $u_t$ .
The coefficient $c(t)$ is now expressed as $x(t)$ .

Their main idea is to keep track of a cumulated summary $\{x_k\}$ and use it to generate the output sequence $\{y_k\}$ , expecting the output representation $\mathbf{y}$ to have stronger memory capabilities compared to those produced by typical sequence-to-sequence models.

LSSL formulates the sequence-to-sequence layer as follows:

\begin{aligned} x_t &= \bar{A}x_{t-1} + \bar{B}u_t ,\\ y_t &= Cx_t + D u_t. \end{aligned}

The first row is quite familiar: it’s exactly the same as the HiPPO update, except now $A$ and $B$ are learnable parameters — we’ll come back to that shortly.

Other parameters $C \in \mathbb{R}^{1 \times N}$ and $D \in \mathbb{R}$ are a linear projection and a scalar weight that interact with $\mathbf{x}$ and $\mathbf{u}$ to output $\mathbf{y}$ , which could be the simplest choice to utilize the summary $\mathbf{x}$ and the input $\mathbf{u}$ but interestingly, also a choice that happens to have properties well-studied in the State-space representation from control theory.

🤔

At the first time I saw the formulation, I thought the second equation could be something else; there’s a lot of choices where you can make

\mathbf{y}

using

\mathbf{x}

, and it is not really necessary to use

\mathbf{u}

But as I read more and more, it turned out that such choice makes many things make sense, especially owing to the well-established theories in control theory.

I’m not really sure if the HiPPO update is inspired by the formulations in control-theory in the first place — but if not, it is quite interesting that two seemingly unrelated theories meet in the middle and one provides theoretical groundings for another. I’ve searched a little, but I haven’t seen any state-space representation from control theory uses the matrix $A$ like ones that are derived in HiPPO.

There are three properties that this LSSL formulation provides:

LSSL is recurrent.
LSSL is convolutional.
LSSL is continuous-time.

From the studies we’ve done in HiPPO, we can admit that such a formulation gives recurrent and continuous-time nature — the formulation surely has a recurrent nature, and HiPPO tried to estimate a continuous function $f$ that creates the discrete sequence $f_t$ .

But where is this “convolutional” property coming from?

The convolutional layout can be found by unrolling the equation $x_t = \bar{A}x_{t-1} + \bar{B}u_t$ :

$\begin{aligned} x_{-1} &= 0, \\ x_0 &= \bar{A} x_{-1} + \bar{B} u_0 = \bar{B} u_0, \\ x_1 &= \bar{A} x_0 + \bar{B} u_1 = \bar{A} \bar{B} u_0 + \bar{B} u_1, \quad \\ &\qquad \qquad \vdots \\ x_{L-1} &= \bar{A}^{L-1} \bar{B} u_0 + \bar{A}^{L-2} \bar{B} u_1 + \cdots + \bar{A} \bar{B} u_{L-2} + \bar{B} u_{L-1} \\ &= \left[ \bar{A}^{L-1} \bar{B} \ \bar{A}^{L-2} \bar{B} \ \cdots \ \bar{A} \bar{B} \ \bar{B} \right] \left[u_0 \ u_1 \cdots u_{L-1} \right]^\top. \end{aligned}$

If you look at this unrolling process carefully, you might notice that each $x_k$ is obtained by convolving a kernel $\mathbf{K} = \left[ \bar{A}^{L-1} \bar{B} \ \bar{A}^{L-2} \bar{B} \ \cdots \ \bar{A} \bar{B} \ \bar{B} \right]$ on a zero-padded sequence $\mathbf{u}$ .

An illustration of convolving kernel that generates \mathbf{x} — An illustration of convolving kernel that generates $\mathbf{x}$

Since $\mathbf{y} = C\mathbf{x} + D\mathbf{u}$ , we can also say (note that $\mathbf{y} \in \mathbb{R}^{ L}, \mathbf{x} \in \mathbb{R}^{N\times L}, C \in \mathbb{R}^{1\times N}, D \in \mathbb{R}$ ):

\begin{aligned} \mathbf{y} &= C (\mathbf{K} \ast \mathbf{u}) + D\mathbf{u} \\ &=(C\mathbf{K}) \ast \mathbf{u} + D \mathbf{u} \\ &= \left[ C\bar{A}^{L-1} \bar{B} \quad C\bar{A}^{L-2} \bar{B} \quad\cdots \quad C\bar{A} \bar{B} \quad C\bar{B} \right] \ast \mathbf{u} + D\mathbf{u} \\ &:= \mathcal{K} \ast \mathbf{u} + D \mathbf{u}. \end{aligned}

Now it becomes clear what the rightmost diagram of Figure 1Show information for the linked content illustrates (although it has omitted the residual connection).

In fact, this concept was not new in control theory, in the name of impulse response $h$ :

In signal processing and control theory, the impulse response, or impulse response function (IRF), of a dynamic system is its output when presented with a brief input signal, called an impulse (δ(t)).
…
the impulse response describes the reaction of the system as a function of time.

Roughly speaking, it’s a system’s response to an impulse input (a sudden high-magnitude input).

In a linear state-space system (or an LTI system, a system that follows thisShow information for the linked content), the system output $y(t)$ is called impulse response $h$ when the input $u(t)$ is impulse (Dirac delta $\delta(t)$ ).

This concept allows us to compute the output of a linear state-space system for any arbitrary input $u(t)$ by convolving the impulse response $h(t)$ with the input. And as you might guess,
the discrete convolution kernel that represents the impulse response is exactly $\mathcal{K}$ .

The convolutional formulation of LSSL already benefits from ideas rooted in control theory —
but one particularly nice insight borrowed from that field is the following (quoted directly from LSSL paper):

a convolutional filter $h$ that is a rational function of degree $N$ can be represented by a state-space model of size $N$ . Thus, an arbitrary convolutional filter $h$ can be approximated by a rational function (e.g., by Padé approximants) and represented by an LSSL.

This strongly supports the expressive power of LSSL: it not only captures a broad range of behaviors that convolution kernels can model, but also sequential modeling capabilities (and gating mechanisms as well, shown by authors with lemmas) of RNNs.

In the meantime, I hope you still keep track of obtaining generalized Show information for the linked content $A$ and Show information for the linked content $B$ .

As LSSL inherits good points of CNN and RNNs, it also inherits negative sides of CNN and RNNs:
they both poorly handle long-term dependencies.

That’s where the authors come up with using HiPPO’s parameters $A$ and $B$ .

Since using $A$ and $B$ derived from orthogonal polynomials already showed benefits in learning long-term dependencies, the authors hypothesized that learning parameterized versions of $A$ and $B$ , while staying within the space defined by those derivations, might be even more effective.

The idea is this: if we can identify the common structure shared by $A$ s and $B$ s that are derived from different orthogonal polynomials, we can embed that structure as a constraint. Then, the network is free to learn the rest — adapting the parameters as needed, but always within the “solid shape” imposed by the orthogonal polynomial framework.

So, how do $A$ and $B$ look like?

At this point, it is worth to note how orthogonal polynomial bases are created, since we need to deal with structures of $A$ and $B$ derived by arbitrary orthogonal polynomials.

How orthogonal polynomials are formed

Every orthogonal polynomial is tied with their specific weight function $\omega(x):[x_1, x_2] \rightarrow \mathbb{R}$ , which determines their characteristics. For example, choosing a uniform $\omega$ yields Legendre polynomials, and $\omega(x) = \sqrt{1-x^2}$ defined on $x \in [-1, 1]$ yields Chebyshev polynomials.

Given a certain $\omega$ , you can repeat Gram-Schmidt process to obtain orthogonal polynomials $\{P_k\}$ that satisfy

\int_{x_1}^{x_2} P_n(x)P_m(x)\omega(x)dx = \begin{cases} 0 & n\neq m \\ c_n & \text{otherwise} \end{cases}.

So, by deriving $A$ and $B$ from an arbitrary weight function $\omega$ , we can move toward our goal of identifying “the structure”.

This derivation is in fact a very tedious job to do (it’s in Appendix D.2), since it becomes impossible to explicitly write how $\{P_k\}$ looks like and you rather need to rely on their properties to carry out the derivation.

I will not explain this process — but the authors ended up concluding two remarks.

While the vector $B$ has no specific structure (highly varies by the measure $\omega$ ),
the matrix $A$ is a low recurrence-width (LRW) matrix.
For some weight functions (that are well known to produce good properties — specifically,
$\omega(x) = e^{-x^2} \forall x \in (-\infty, \infty)$ : corresponds to Hermite polynomials,
$\omega(x) = (1-x)^{\alpha}(1+x)^{\beta} \ \forall x\in[-1,1]$ : corresponds to Jacobi Polynomials, and
$\omega(x) = x^\alpha e^{−x} \ \forall x \in [0, \infty)$ : corresponds to Laguerre polynomials),
the matrix $A$ becomes 3-quasiseparable.

I know — all these unfamiliar math terms can feel overwhelming. But don’t worry, here’s a very quick breakdown:

An LRW matrix is a special kind of square matrix related to orthogonal polynomials.
$\{$ LRW matrices $\}$ is a subset of $\{$ square matrices that allow near-linear operation to compute matrix-vector multiplication $\}$ , so it has advantage over other matrices in computational efficiency.
A quasiseparable matrix is a matrix whose off-diagonal blocks have low rank. The number in front (like "1-quasiseparable") specifies constraints on the ranks of those submatrices, which directly impacts how efficiently we can compute with them.

These structural properties not only reveal something fundamental about how $A$ is structured, but also bring a major computational advantage. In particular, they enable efficient construction of this kernelShow information for the linked content, which involves computing powers of $A$ — typically an expensive operation that becomes much faster thanks to this structure.

It is quite an fascinating breakthrough: we usually trade off efficiency for better performance.
But in this case, the structure of $A$ gives us both.

We tested that LSSLs with random $A$ matrices (normalized appropriately) perform very poorly (e.g., 62% on pMNIST).

In my opinion, this part is the most beautiful aspect of state-space models:
their ability to model long-term dependencies and maintain computational efficiency — all grounded in elegant mathematical principles.

Performance table shown in LSSL. They perform REALLY well.

One more impressive point of LSSL is that LSSL-fixed — a network that occupies fixed $A$ and $B$ from HiPPO — already surpasses other model. Making them learnable can further boost the performance.

Efficiently Modeling Long Sequences with Structured State Space (S4)

Although LSSL already looks like a fantastic model, they actually had concerns:
(quoted from LSSL)

Sections 1 and 3 and Fig. 1 mention that a potential benefit of having the recurrent representation of LSSLs may endow it with efficient inference. While this is theoretically possible, this work did not experiment on any applications that leverage this. Follow-up work showed that it is indeed possible in practice to speed up some applications at inference time.
…
A follow-up to this paper found that it is not numerically stable and thus not usable on hardware.

In fact, LSSL was NOT really occupying efficiency derived from the structure of $A$ due to the numerical instability.

This paper shows that the faster algorithm based on the structure of $A$ involves the terms that grow exponentially with $N$ — the number of state dimensions.
(quoted from S4)

Lemma B.1. When $A = I$ , the fast LSSL algorithm requires computing terms exponentially large in $N$ .
…
the largest coefficient is $\frac{(L-1)(L-2)\cdots (L-N+1)}{(N-1)!}$ . When $L=N-1$ this is ${2(N-1)\choose N-1} \approx \frac{2^{2N}}{\sqrt{\pi N}}$ .

This creates a frustrating paradox — we know there's an efficient algorithm in theory, but it becomes practically unusable due to physical (numerical) limitations of real-world computers.

The authors come up with a new model, S4, that addresses this challenge.

In this case, I’ll skip most of the model details, since they involve quite a lot of advanced math. But the good news is, you don’t need to understand all the technical details to follow the main idea. The very brief summary of this algorithm is as follows:

To simplify the problem, the authors restrict the scope of matrix $A$ to only few cases considered in HiPPO: ones obtained by using Laguerre and Legendre polynomials.
The authors figured that these matrices share a common structure of
(normal matrix + low rank matrix), in short, NPLR (Normal Plus Low Rank).
The main problem of generating the convolution kernel $\mathcal{K} = \left[ C\bar{A}^{L-1} \bar{B} \quad C\bar{A}^{L-2} \bar{B} \quad\cdots \quad C\bar{A} \bar{B} \quad C\bar{B} \right]$ was to compute the matrix $A$ ’s power.
Under the assumption of $A$ now being NPLR, authors computed the kernel values by finding an equivalent operation in frequency space.
If you want to see a bit more detailed steps, check these out.
- Step 1: Move to frequency domain.
  Represent the kernel $\mathcal{K} = \left[ C\bar{A}^{L-1} \bar{B} \quad C\bar{A}^{L-2} \bar{B} \quad\cdots \quad C\bar{A} \bar{B} \quad C\bar{B} \right]$ with generating function (a.k.a “z-trasformation” in the literature of control theory).
  Intuitively, it is a function that maps $\mathcal{K}$ from time domain to frequency domain, which can be reverted back to time domain via inverse Fourier transform.
  Roughly speaking,
  $\mathcal{K}$ = inverse_fast_fourier_transform(generating_function( $\mathcal{K}$ )).
- Step 2: Replace matrix powers with inverses.
  The modified representation enables substitution of “power of $A$ ” to “inverse of $A$ ”. In other words, computing $A^{-1}$ conveniently offers us $A^l$ ’s.
- Step 3: Exploit the NPLR structure.
  Since $A$ is NPLR, we can convert $A = V\Lambda V^* - PQ^{\top}$ . This can be conjugated into DPLR (Diagonal Plus Low Rank) over $\mathbb{C}$ , say, $\Lambda - PQ^*$ .
- Step 4: Apply the Woodbury Identity and Cauchy kernel computation algorithm.
  This identity reduces the problem of computing the inverse of $\Lambda - PQ^*$ to the problem of computing the inverse of $\Lambda$ , which now involves the computation of Cauchy kernel — a well-studied problem that can be solved with near-linear algorithms.
- This blog might help for more details.

With such sophisticated math skills, the authors finally made LSSL an efficient algorithm.

It even almost catches up the efficiency of Linear Transformer.

Complexity comparison between various sequence models

More importantly, it still maintains its strength: learning long-range dependencies.

Performance comparison on Long Range Arena

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Now here comes Mamba. To briefly summarize the current flow, we can state it like:

HiPPO established a method to summarize and recover long sequences using orthonormal polynomials, and first invented the notion of the matrix $A$ .
LSSL formulated the state-space models (SSMs), generalized the matrix $A$ , and succeeded to transform it into a learnable structure.
S4 enhanced LSSL by genuinely employing the efficient structure of $A$ .

What happened next?

There has been a lot of advance in SSMs between S4 and Mamba.
The notable models are followings:

DSS (NeurIPS 2022):
Showed that diagonal approximation of the structured matrix $A$ can achieve performance comparable to S4’s DPLR formulation.
S4D (NeurIPS 2022):
Suggested simpler, faster, and theoretically more sound diagonal approximation of $A$ , with minimal expressivity loss compared to S4’s $A$ .
S4ND (NeurIPS 2022):
Proposed the extension of S4 that can handle multi-dimensional data.
S5 (ICLR 2023):
Established the Multi-Input-Multi-Output (MIMO) framework of SSM, and adopted Blelloch parallel scan for faster, parallelizable computation of the kernel $\mathcal{K}$ .

For 2 years after the advent of S4, the researchers in state-space models have succeeded to simplify the structure of $A$ into diagonal matrix, established its initialization, and enabled even faster computation for the kernel ( $\mathcal{K}$ ) construction.

Mamba’s motivation was something apart from these problems. It rather tackled the foundational problem of SSMs: “content-awareness”.

Copying task example: move the first N blocks in the input sequence to the tail.
Selective copying task example: move the first colored N blocks in the input sequence to the tail. — Copying task example: move the first $N$ blocks in the input sequence to the tail.
Selective copying task example: move the first **colored** **$N$** blocks in the input sequence to the tail.

The authors observed that SSMs are good at the Copying task, which simply requires memorizing and reproducing the entire sequence of input tokens in order. However, they perform much worse on the Selective Copying task, a slightly more challenging variant that requires content-awareness: the model must still preserve the order of tokens, but only for a specific subset (e.g., the “colored” tokens).

This really highlights both the strength and the limitation of SSMs. On the one hand, they’re fantastic at memorization — if you ask them to store a sequence, they’ll remember every element faithfully. On the other hand, that’s also the problem: they remember everything with equal importance. It is quite obvious from its initial derivation — using orthogonal polynomial to reconstruct all of them (although the weight function $\omega$ somehow governs where should SSMs focus more to lower the approximate error, but it still does not change depending on the input). There’s no built-in mechanism to filter out noise or emphasize what really matters.

And that’s exactly where the gap shows up when we think about scaling SSMs toward LLM-like reasoning. Human-like reasoning doesn’t come from replaying every detail word-for-word — it comes from blurring out the unimportant stuff, compressing the input into abstract concepts, and then reasoning from those abstractions. Without that ability, SSMs risk getting stuck as great memorization machines, but not great reasoning machines.

The idea to make SSMs aware of the input contents was quite simple and straightforward: the authors let the model parameters determined by the input itself.

Above is one ablation from the manuscript that shows the performance gain from input-dependency (selectivity). Depending on the context, the input can “choose” a slightly different version of the model to pass through. The effect is quite notable: instead of rigidly applying the same dynamics everywhere, the model adapts its parameters on the fly, shaping itself to better process whatever information is currently in front of it.

You can see the parameters are now determined by the input x , and have batch dimension. This means that the inputs encounter model parameters specifically tailored for them. — You can see the parameters are now determined by the input $x$ , and have batch dimension. This means that the inputs encounter model parameters specifically tailored for them.

At first glance, making parameters input-dependent feels like a pretty straightforward upgrade. In fact, letting a model adapt to its context is not exactly a new idea in machine learning.

But within the world of SSMs, it was actually a big deal.

Why? Remember that SSMs have been through a lot of attempts to handle its speed on top of its elegant theory: LSSL was too slow, S4 came up with sophisticated fast-working algorithms, and subsequent works devised diagonal $A$ to compute the kernel $\mathcal{K}$ even faster. On top of this flow, adopting input-dependent parameters will add a heavy computational burden, which couldn’t be justifiable unless the authors invent another breakthrough to enhance its efficiency.

This time, the authors took engineering-type of approach: hardware-aware algorithm.

I am not an expert in this field, so I will list my understandings about this algorithm in easy, understandable words.

Apparently there are two parts in GPU that are mainly responsible for feed-forward process.
- High-bandwidth memory (HBM)
- Static Random-Access Memory (SRAM)
HBM allows greater memory but it’s slower in computation, while SRAM is the opposite.
The biggest efficiency bottleneck comes from memory I/O in GPU. Modern GPUs tend to compute the intermediate state ( $x$ of this equationShow information for the linked content) in HBM. If one could move this computation in SRAM (but this time without saving the intermediate states) it bypasses such memory read/write operations. Plus, it has better computation speed.
For backpropagation one needs to re-compute the intermediate states since they are not stored during feed-forward. (just like gradient checkpointing)
For sequence length 𝐿 too long where the sequence cannot fit into SRAM (which is much smaller than HBM), the sequences split into chunks and perform the fused scan on each chunk.

In short, they moved slower, tricky process of SSM computations to fast SRAM while handling the insufficient memory of SRAM.

This almost feels like a cheat code; if we were aware of such mechanisms and had abilities to implement these, applying this type of algorithms could eventually enhance most of the networks we want to design. In fact, this is done with sophisticated CUDA-level programming with the aid of Albert’s former labmate, Tri Dao, who is known for efficient hardware-aware algorithms and also the author of renowned FlashAttention.

Albert and Tri had already worked on fast computation using structured matrices, so it isn’t surprising that they work on efficient algorithm to construct $\mathcal{K}$ with structured matrix $A$ . Combining these matrix structures with hardware-aware design might well have been their ultimate recipe for improving the algorithm.

Anyways, it is actually not as simple as it sounds. You’ll need some serious level of CUDA-level coding skills and high-level understanding of GPU… or a smart labmate willing to handle that part for you.

On top of this algorithm, Mamba additionally introduced interesting insights related to Mamba block design or initialization of $A$ , but I won’t describe all of them in this post.

From the broader view of SSM evolution, Mamba marks a real milestone — it’s the first time SSMs stood toe-to-toe with Transformers, matching their performance on tough benchmarks while using less computation.

Mamba is the first attention-free model to match the performance of a very strong Transformer recipe (Transformer++) that has now become standard, particularly as the sequence length grows.

Closing thoughts

Tracing the evolution of SSMs has been fascinating to me because it really shows how theory, computation, and creativity keep bouncing off each other.

LSSL started as this mathematically elegant idea but hit the wall of numerical infeasibility. S4 then came in like a clever hack, finding mathematically equivalent operation that works in practice. From there, diagonal approximations (DSS, S4D), multidimensional extensions (S4ND), and efficient kernels (S5) kept pushing the boundaries. Along the way, we saw SSMs shine at memorizing long-term context, but also bump into the limits of content-awareness — a reminder that remembering everything isn’t the same as understanding.

And now, with Mamba, it feels like SSMs have finally stepped into the big league. They now stand next to Transformers, but with a greater efficiency and better long-term context understanding.

I hope this journey made Mamba’s seemingly unrelatable connection between its structure and its performance much clearer, and maybe even sparked a few ideas for your own research too!