Sequential data is everywhere:

  • text
  • dynamics
  • protein synthesis
  • genome
  • music
  • biological signals

Basics and remarks history and motivation around new Sequential Models

Finding Structure in Time - Elman, 1990.

  • Option 1: “One obvious way to represent time explicitly by associating the serial order of the pattern.The first temporal event is represented by the first element in the pattern vector, the second temporal event is represented by the second position in the pattern, and so on.”

Input where is the sequence Length (fixed) and the is the feature dimension. The number of parameters is in the order of and thus is very big.
so we move from to .

  • Option 2: “There is another very different possibility: allow time to be represented by the effect it has on processing. ”

You learn to construct a memory of the “input seen so far”. The memory should capture the joint data distribution useful for a particular task.

Most natural model? RNNs (LSTM, GRU, etc. ).

Until very recently 2016 Google’s Neural Machine Translation system: bridging the gap between Human and Machine Translation was published, and used LSTM.

Then, came Attention is all you need which is more of a smart Option 1.

LSTMs, RNNs

Inference time is , non linear sequential computation cannot be parallelized easily. They’re hard to train, suffering from vanishing/exploding gradient.
Variants of RNNs, like LSTMs, and GRUs are easier tot rain, still not parallelizzabile.

Transformers, powered by attention

They’re easy to scale (deeper, wider) and train on modern hardware.
Not natural on **long sequences: **

  • The’re not natural on long sequences: in particular, the inference is slow ; long context is however possible with massive compute or tricks;

Recent progress in RNNs

Some new stuff came in RNNs, like S4, S5 and LSU.
S4 started it all with Efficiently Modeling Long Sequences, with Structured State Spaces.
complexity, S4 is a RNN that runs fast on GPUs.
S4 outperformed Transformers in long-distance sequences (LSTM and RNNs still performed very poorly in this).

Pretraining WIthout Attention, mainly because of linear inference time, H3.

SOTA efficient LMs based on SSMs, all are RNNs at test time. They are trained as transformers and deployed as RNNs.

Linear Attention Transformers can be described as RNNs. Maybe.

Inference and memory scale much better and seem to have better performance often reported in LM (Mamba-2) on 3.5T on 8B parameters.

Some things are just posssible now

Like DNA Data: 1M nucleotides sequences are now processable (Hyena DNA: long-Range Genomic Sequence Modeling at Single Nucleotide Resolution).

Basic Design Principle in SSMs (S4)

S\Delta\Delta$ parameter.

The hidden state should solve the task, this is what was being done years ago. Today instead they’re just trying to compress data. It’s a block, a piece.

Can’t we match the performance and efficiency of deep continuous-time using simpler deep RNNs?

  • Step-by-step theory-inspired modification of vanilla RNNs;

They replaced attention with linear RNNs, dropping recurrent nonlinearities improves accuracy.
Nonlinearities can hurt optimization.
Important advantages: is that linear dynamics can be parallelized. You can parallelize Linear RNNs.

Linear RNNs can be learned in complex-diagonal form making the computation super fast and powerful. The efficiency of Diagonal Linear RNN work much better and faster.
Also the parametrization is more aligned to the processing → Adam works better.

Normalization is needed to balance long/short range.


RNNs have training problems, two model can be equivalent at test, but be drastically different when training.

Recurrent Neural Networks: vanishing and exploding gradients are not the end of the story.

Why are linear RNNs effective?

Linear RNNs can perform lossless compression. Unrolling linear RNN can be written in matrix forms.
You can invert it and retrieve the input.But you can’t because the vandermonde matrices is ill positioned, you can well-condition the depending of the location of the nodes in the complex plane.

In one layer they can approximate non-linear dependencies.


S4/LRU are expressive when combined### What are the new RNN we use today with MLP but alone cannot go beyond linear filtering.

Mamba, Griffin, GLA, HGRN2, mLSTM closely resembles a GRU but linear in the state. Removing hidden state dependency in forget gate allows for parallelization on GPU, the forget gate doesn’t depedend on the state bu on the input.

We need to do content-depdendent filtering (forgetting).

Theoretical Foundations of Deep Selective State-Space Models.

What so special about it? it basically is an iterated MLP. Can Approximate any non linear dynamical system.

New RNNs, e.g. SSMs like Mamba.

It’s not an itarated MLP anymore, it is linear and diagonal in and non linear in .

The illusion of State in State-Space Models.

Diagonal SSMs & Attention struggle at state tracking. Also, SSMs can struggle at copying while attention with induction heads can remember perfectly.

An emperical Study of Mamba-based Language Models, Mamba and Mamba-2 works well on texts. They are not good in tasks that involve copying.
There seems to be exprespressivity issues at small scale, but not on larger scale they seem to work faster than transforemrs.

Gated Delta Netwrok: impriving mamba2 with delta rule.

DeltaProduct: increasing hte expressivity of DeltaNet Trough Products of Householders.

Fixed-Point RNNs: from diagonal to dense in a few iterations.
Most of these effects are due to latent SSM optimization problems. SSMs can actually copy and recall, one just needs to train very carefully. (Revisiting associative recals in modern recurent models) by just using good learning rates it just works.

Lecure: nonconvex Optimization for Deep Learning.