skip to content
Site header image Jinsung Lee’s Personal Homepage
Email copied to clipboard

Towards 2-Dimensional State-Space Models (3): State-Space Regularization


This post continues from the previous post: Towards 2-Dimensional State-Space Models (2): WHippo.


Let’s recap the principles of state-space models (SSMs) we have settled in the previous post.

  • The main equation of the SSMs is determined by two components:
    compression function c()\mathbf{c}(\cdot) and input transformation θ()\theta(\cdot).
  • SSM’s goal is to let the network distill the property of the compression function c()\mathbf{c}(\cdot), by mimicking its behavior under the input transformation θ()\theta(\cdot).
  • We can manually derive the compression function’s dynamics dc(It)dt\frac{\textrm{d}\mathbf{c}(\mathbf{I}_t)}{\textrm{d}t}, and let the hidden state follow the dynamics to behave as a “compressed” state.
  • SSMs often let the compression function c()\mathbf{c}(\cdot) be a projection using 1D orthogonal basis functions and define θ()\theta(\cdot) as a token concatenation, leading them to obtain a compression function dynamics in a form of following differential equation:
    dc(It)dt=Ac(It)+Butinduced udpate ruleht=Aht1+But \frac{\textrm{d}\mathbf{c}(\mathbf{I}_t)}{\textrm{d}t} = \mathbf{A}\mathbf{c}(\mathbf{I}_t) + \mathbf{B}u_t \quad \xRightarrow[]{\text{induced udpate rule}} \quad h_t = \overline{\mathbf{A}}h_{t-1} + \overline{\mathbf{B}}u_t

    where utu_t is a token that appears at the tt-th timestep.

  • WHippo suggests letting c()\mathbf{c}(\cdot) be a 2D orthogonal polynomial projection and θ()\theta(\cdot) be a Gaussian blurring, which ends up yielding:
    dc(It)dt=Ac(It)induced udpate ruleht=Aht1\begin{equation} \frac{\textrm{d}\mathbf{c}(\mathbf{I}_t)}{\textrm{d}t} = \mathbf{A}\mathbf{c}(\mathbf{I}_t) \quad \xRightarrow[]{\text{induced udpate rule}} \quad h_t = \overline{\mathbf{A}}h_{t-1} \end{equation}

The most noticeable difference is the existence of the input term utu_t. This difference emerges from the fact that we do not define the input transformation θ()\theta(\cdot) as an information-adding manner; in fact, it is rather a information-losing process, since the information of a blurred image It\mathbf{I}_t is strictly a subset of the information of the sharper image It1\mathbf{I}_{t-1}.

🤔
Wait, then we could’ve developed a more natural extension by assuming θ()\theta(\cdot) to be Gaussian deblurring instead of blurring—so that it similarly becomes a information-gaining process, making it more analogous to the original 1D SSM by setting utu_t as a missing information?

In fact, it was my original attempt; we can derive the relationship between two coefficients c(It)\mathbf{c}(\mathbf{I}_t) and c(It1)\mathbf{c}(\mathbf{I}_{t-1}) using the information gap between two images ut:=It1Itu_t := \mathbf{I}_{t-1} - \mathbf{I}_{t}, which becomes:

c(It)=c(It1)+c(ut).\mathbf{c}(\mathbf{I}_{t}) = \mathbf{c}(\mathbf{I}_{t-1}) + \mathbf{c}(u_t).

Due to the linearity of the basis projection c()\mathbf{c}(\cdot), our resulting equation ends up feeling a bit underwhelming. Note that the whole point of SSM is to get the network to distill the properties of compression function c()\mathbf{c}(\cdot). This formulation guides the network to obtain linearity, which indeed is a property of c()\mathbf{c}(\cdot), but it is not exactly the most unique property of the compression function.

You might be able to derive non-trivial equation if you choose c()\mathbf{c}(\cdot) other than basis projections (such as PNG or H.264); but I leave it as a future exploration as of now.

As a result, we no longer can stick to the traditional way of sequence modeling, where the model updates the hidden state depending on the inflowing new tokens.

State-space modeling as a regularization

Thus, the WHippo’s state update needs to happen only in a form of regularization, rather than an explicit update like how 1D SSMs do.

For instance, let us define a parameterized feature extractor ff and say, we want this feature extractor to produce hidden states: ht:=f(It)h_t:= f(\mathbf{I_t}). Then, we can regularize ff to follow the proposed dynamics Eq. (1)Show information for the linked content by the following loss L\mathcal{L}:

L=d(Af(It1),f(It)),\begin{equation} \mathcal{L} = d\bigl(\overline{\mathbf{A}}f(\mathbf{I}_{t-1}), f(\mathbf{I}_{t})\bigr), \end{equation}

where d(,)d(\cdot, \cdot) is a distance measure.

Enabling SSM’s principles in such a way gives a somehow intriguing advantage:

Now we can adopt state-space models without even employing SSM architectures.

The SSMs enabled with the regularization loss L\mathcal{L} are now independent from the model ff’s architecture. Hence, we can adopt the SSM’s principle without explicitly using the well-known SSM blocks such as S4, S4D, or S5. Here I omit Mamba since we are dealing with linear-time invariant (LTI) SSMs at least for now.

However, this is where WHippo becomes harder to be compared with other SSMs, since its performance largely varies by the architecture we choose for ff.

An interesting usage of the state-space regularization

From now on, I’ll call the regularization loss L\mathcal{L} from Eq. (2)Show information for the linked content the state-space regularization for future reference.

I was wondering where this regularization could become useful and discovered that this can provide a quite niche value in image tokenization.

What is image tokenization?

Well, in the modern era of AI, everything happens in a latent space—onto which the input is projected in order to be handled in a lighter, interpretable way.

An image of a tokenizer . Tokenizer encodes the input into a lower-dimensional latent space to produce “latent feature”, and decodes it back to the original input.
An image of a tokenizer. Tokenizer encodes the input into a lower-dimensional latent space to produce “latent feature”, and decodes it back to the original input.

Image tokenizer often provides a way to represent RGB images in a form of tokens: they are easier to handle for neural networks compared to RGB image, and often encode image semantics that pixel values cannot provide.

The most common use case of this tokenization is generation, since the tokenization greatly reduces the number of tokens to process, and this can help accelerating the resource-demanding generation process by the order of magnitude.

The research trends in world models or representation learning have also been shaping toward latent modeling these days, so the value of this tokenization is rising continuously. One of my recent publications also features this, so check it out!

Anyway, in this post, let’s focus only on tokenization for generation, as this area is where state-space regularization becomes valuable.

A good image tokenization

What defines a good image tokenization? There could be several but at this point, the following two are the most important factors:

  1. Compactness: an image tokenizer should recover images sufficiently well from the latent tokens.
  2. Generation-friendliness: an image tokenizer should provide generation-friendly latent space, where the latent distribution is easily modeled with existing generation models.

Compactness being a good property is quite obvious: the tokens should hold the image information accurately in a fixed token budget in order to properly represent a pixel-heavy image.

Generation-friendliness matters if you are planning to use the tokenizer to any type of generation tasks; which may include image generation/editing, planning, or manipulation—and all of them stand in the middle of AI research right now.

The problem is that nobody actually knows what makes a tokenizer "generation-friendly" yet. Right now, the only way to find out is to train a full generative model from scratch and see how it performs. As you can guess, that burns through massive amounts of time and money, and that is exactly why researchers are trying to find a shortcut—a way to measure a tokenizer's generation capability before the generative model training. Below are some notable efforts:

  • EQ-VAE, FT-SE: enforcing rotation- or scale-equivariance between latent and pixel spaces helps
  • MAETok: latent space represented with fewer Gaussian modes helps
  • VA-VAE, RAE, ReDi: involving vision foundation model helps
  • UL: directly involving a diffusion prior helps
🤔
This problem is one of my biggest interests these days, and my colleagues and I are constructing insights on what consists of such a generation-friendly latent space. I’ll share our ideas at some point once it is well-organized and proven.

A noteworthy theory in this field is that compactness and generation-friendliness are hardly achieved together.

A figure brought from  VA-VAE . We often observe the tokenizer that achieves better reconstruction quality performs worse generation.
A figure brought from VA-VAE. We often observe the tokenizer that achieves better reconstruction quality performs worse generation.

The reconstruction quality in the figure above tells us how well a tokenizer can preserve the original image detail (see thisShow information for the linked content for the reference; the detail of the castle has changed significantly). In general, a more compact tokenizer packs more information into each token, which often leads to higher reconstruction quality.

However, the more information that is compactly packed into a token, the more you will have to give up a certain degree of generation quality according to this reconstruction-generation trade-off. To intuitively explain, the compact tokens often have to encode pixel-perfect details of an image, thereby creating immense complexities in the latent distribution, making generative models hard to learn the latent distribution while training.

Why would state-space regularization help?

We thought our state-space regularization might help achieving both at the same time for two reason:

  1. Our regularization nudges the hidden state (here, the latent feature) to resemble the compression function, so there is a chance of it producing more compact latent features.
  2. Choosing basis projections as compression function enables the latent to capture spectral components of the input, which helps generative models to synthesize images in a structured order.

The second claim is primarily based on a famous insight delivered from Generative Model with Inverse Heat Dissipation or Diffusion is Spectral Autoregression: diffusion models generate images by first generating low-frequency components of images, and then the high-frequency details. Recent studies in this field often explicitly design diffusion to happen in such order, achieving better generation-friendliness.

A figure brought from  Sander Dieleman’s blog post: Diffusion is Spectral Autoregression . Noising process first removes the image content in high-to-low frequency order, and the diffusion process does exactly opposite: synthesize the low-frequency contents first and then generate high-frequency details based on the generated layout.
A figure brought from Sander Dieleman’s blog post: Diffusion is Spectral Autoregression. Noising process first removes the image content in high-to-low frequency order, and the diffusion process does exactly opposite: synthesize the low-frequency contents first and then generate high-frequency details based on the generated layout.

State-space regularization applied to image tokenizers

We designed our regularization to be applied to an image tokenizer as follows:

The network is trained to match  \mathbf{\hat{I}}  to  \mathbf{I} .  \tau_k  is a randomly sampled blur level.
The network is trained to match I^\mathbf{\hat{I}} to I\mathbf{I}. τk\tau_k is a randomly sampled blur level.

Regularization strength α\alpha indicates how often a tokenizer should focus more on creating relationships between two latents rather than reconstructing the input. Through this way, we can make latents of a sharper image z1\mathbf{z}_1 and a blurrier image z2\mathbf{z}_2 follow the predefined basis coefficients dynamics. Hence, we can expect each channel of a latent will behave as if it is a coefficient of 2D orthogonal polynomials.

When this regularization is applied to fine-tune tokenizers, it turned out that our assumptionsShow information for the linked content are half-wrong and half-right.

Reconstruction and generation experiment results on  ImageNet-1K  when the state-space regularization ( Eq. (2) ) is applied.
Reconstruction and generation experiment results on ImageNet-1K when the state-space regularization (Eq. (2)Show information for the linked content) is applied.
  1. Reconstruction quality is not enhanced. Although the regularization indeed guides the tokenizer’s encoder to divide the input image information in an orthogonal manner that enables compact compression (that’s how basis coefficients store the image information), it sacrifices the reconstruction training iterations by a factor of (1α)(1-\alpha). Apparently, the tokenizers prefer to reconstruct images based on strategies created on their own, rather than being guided with compression function’s prior. Yet, it did not significantly harm the reconstruction performance either, even when compared to other regularizers that train the tokenizers to reconstruct images for the whole training iterations.
  2. Generation quality is enhanced. Our regularization does guide the latent feature’s channels to embed hierarchical nature of frequency components as we expected. Generation performance metrics indicate that the latent structures we enforced helped the downstream generative models to synthesize images. We also believe that this regularization encourage latent representations of images with similar spatial layouts to cluster together. Since such images naturally share similar low-frequency components, their corresponding latents are likely to agree on at least the low-frequency channels. This behavior may help shaping a smoother latent distribution. Intuitively, placing structurally similar images far apart in latent space introduces unnecessary complexity into the representation. By encouraging these latents to remain close, the model can organize the latent space more coherently, potentially making it easier to model and interpolate.
    Since the channels encode low-to-high frequency components of an image, we can validate if each channel really contains such information by partially masking latent channels and decoding it back to the pixel space. The visualization shows the progression as we unmask channels from low-to-high frequencies. Since low-frequency channels encode general layout of an image, you can see that using only 3 low-frequency encoding channels already enables a pretty good image reconstruction.
    Since the channels encode low-to-high frequency components of an image, we can validate if each channel really contains such information by partially masking latent channels and decoding it back to the pixel space. The visualization shows the progression as we unmask channels from low-to-high frequencies. Since low-frequency channels encode general layout of an image, you can see that using only 3 low-frequency encoding channels already enables a pretty good image reconstruction.
    Generation examples from  LightningDiT-B/2 , using the Flux tokenizer
    Generation examples from LightningDiT-B/2, using the Flux tokenizer

State-space regularization applied to image restoration

Regarding that state-space regularization produces features that separately encode different frequency bands, I though it might benefit tasks like image restoration, such as denoising or super-resolution.

I did not dig these tasks deeper, but went through a prototypical attempt to see if our regularization method can potentially benefit image denoising (it actually is what generative model does, so it is worth trying!).

I chose one of the renowned image restoration model Restormer, and trained it for 30K iterations to denoise the real-world images with/without state-space regularization.

Image denoising results on real-world denoising dataset  SIDD . Brightness of the image is slightly adjusted to better display the qualitative results.
Image denoising results on real-world denoising dataset SIDD. Brightness of the image is slightly adjusted to better display the qualitative results.

The effect was quite notable:

Model PSNR SSIM
Restormer 39.5803 0.9111
w/ Ours 39.8939 0.9128

I initially considered exploring this direction further, but soon figured out that training image restoration models typically takes several days even for a single experiment. Given that this was only meant to be a proof-of-concept exploration, the required effort felt a bit too heavy, so I decided to leave it as an interesting future direction instead.

Closing thoughts



This series began with a simple goal: to come up with a 2-dimensional state-space model that makes more sense for visual data.

Interestingly, the journey did not end where I expected it to. The outcome was neither a new SSM module nor yet another Mamba-based vision backbone that achieves state-of-the-art results across benchmarks. Instead, it led me to a different way of applying SSMs to computer vision: through regularization.

While searching for a suitable application, the story took another unexpected deviation and drifted into the field of image tokenization—a topic that honestly deserves an entire series of its own. Apologies if some of the transitions felt abrupt or the explanations were a bit rushed 😅. Nevertheless, the resulting idea ended up making a surprisingly meaningful contribution: SSMs have rarely been used as a feature regularizer, nor have they explicitly been employed to improve frequency-awareness to enhance image generation and restoration.

What I found particularly exciting is that this exploration also uncovered some theoretically appealing connections. Along the way, we arrived at a mathematically well-defined structure for the A\mathbf{A} matrix without explicitly engineering it to be diagonal, and stumbled upon a perspective that hints at a more generalized view of state-space models.

Perhaps this research was the most bent and crooked research I’ve ever done: I set out to solve one problem and end up finding something entirely different. But in fact, it was a very enjoyable journey, and I felt quite satisfied about the lessons I learned (although it is somewhat hard to be framed as a single paper in an organized manner 😂).

I hope some of the ideas in this series resonate with fellow SSM enthusiasts. If you have thoughts, questions, or completely different interpretations, I'd love to hear them in the comments.

With that, this concludes the "Towards 2-Dimensional State-Space Models" series.
Next time, I'll return with something a little more casual: a post about my time in Brazil while attending ICLR 2026! Tchau 🤙🏼