Machine learningTime-series forecasting

Reformer: The Efficient Transformer for Long Sequences

Reformer (The Efficient Transformer) · Also known as: Efficient Transformer, LSH Transformer, Locality-Sensitive Hashing Transformer, Verimli Dönüştürücü

The Reformer is an efficient variant of the Transformer architecture introduced by Kitaev, Kaiser, and Levskaya at ICLR 2020. It addresses the prohibitive O(L²) memory and computational cost of standard self-attention for long sequences. The key innovations are locality-sensitive hashing (LSH) attention, which approximates full attention in O(L log L) time, and reversible residual layers that dramatically reduce activation memory during training.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Reformer

Informer Pyraformer

When to use it

Use the Reformer when your sequences are very long (tens of thousands of time steps or tokens) and standard Transformer training is infeasible due to memory constraints. It suits univariate or multivariate time-series forecasting tasks with extended context windows. Assume the input is tokenizable and that approximate attention suffices. The LSH approximation may hurt accuracy on short sequences where full attention is affordable. For moderate-length sequences consider the vanilla Transformer or Informer; for extreme-length tasks Pyraformer or Autoformer may also be relevant.

Strengths & limitations

Strengths

Reduces self-attention complexity from O(L²) to O(L log L), enabling very long sequence modeling
Reversible residual layers cut memory usage from O(depth × L) to O(L), making deep models feasible on limited hardware
Chunked feed-forward layers further reduce peak memory with no change to model outputs
Compatible with standard Transformer training pipelines and pre-training paradigms

Limitations

LSH attention is an approximation; rare but important long-range dependencies may be missed if relevant tokens fall into different buckets
Multiple hash rounds improve recall but add computational overhead and implementation complexity
Reversible layers complicate gradient checkpointing strategies and are harder to integrate with some existing frameworks
For short sequences the overhead of LSH bucketing can make the Reformer slower than a standard Transformer

Frequently asked

How does LSH attention differ from standard multi-head attention?

Standard multi-head attention computes dot products between every query-key pair, giving O(L²) cost. LSH attention first hashes queries and keys into buckets using random projections, then restricts attention to within-bucket pairs. Because similar vectors tend to land in the same bucket, most relevant interactions are preserved while cost drops to O(L log L). Multiple hash rounds increase recall at proportional cost.

What are reversible residual layers and why do they save memory?

Reversible layers, adapted from RevNet, split the residual stream into two halves updated alternately. Given the output of a layer, its input can be exactly recomputed during backpropagation, so intermediate activations do not need to be stored. This collapses activation memory from O(depth × L) to O(L), which is the dominant saving that lets deep Reformers run on long sequences.

Is the Reformer suitable for short time-series tasks?

Generally no. For sequences of a few hundred to a few thousand steps, standard Transformers or simpler architectures such as N-BEATS or PatchTST are faster and often more accurate. The Reformer's efficiency gains become meaningful only when sequence length is large enough that standard attention is a bottleneck, roughly L > 4 000 in practice.

Sources

Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. ICLR. link ↗

How to cite this page

ScholarGate. (2026, June 2). Reformer (The Efficient Transformer). ScholarGate. https://scholargate.app/en/deep-learning/reformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

InformerDeep learning↔ compare
PyraformerDeep learning↔ compare

Compare side by side →

Referenced by

Pyraformer

Related reference concepts

Sequence-to-Sequence Models and Transformers Convolutional and Sequence Models Self-Supervised and Representation Learning Backpropagation and Optimization Neural Network Architectures Deep Learning

Spotted an issue on this page? Report or suggest a fix →

Reformer: The Efficient Transformer for Long Sequences

Reformer (The Efficient Transformer) · Also known as: Efficient Transformer, LSH Transformer, Locality-Sensitive Hashing Transformer, Verimli Dönüştürücü

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Reformer

Informer Pyraformer

When to use it

Strengths & limitations

Strengths

Reduces self-attention complexity from O(L²) to O(L log L), enabling very long sequence modeling
Reversible residual layers cut memory usage from O(depth × L) to O(L), making deep models feasible on limited hardware
Chunked feed-forward layers further reduce peak memory with no change to model outputs
Compatible with standard Transformer training pipelines and pre-training paradigms

Limitations

LSH attention is an approximation; rare but important long-range dependencies may be missed if relevant tokens fall into different buckets
Multiple hash rounds improve recall but add computational overhead and implementation complexity
Reversible layers complicate gradient checkpointing strategies and are harder to integrate with some existing frameworks
For short sequences the overhead of LSH bucketing can make the Reformer slower than a standard Transformer

Frequently asked

How does LSH attention differ from standard multi-head attention?

What are reversible residual layers and why do they save memory?

Is the Reformer suitable for short time-series tasks?

Sources

Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. ICLR. link ↗

How to cite this page

ScholarGate. (2026, June 2). Reformer (The Efficient Transformer). ScholarGate. https://scholargate.app/en/deep-learning/reformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

InformerDeep learning↔ compare
PyraformerDeep learning↔ compare

Compare side by side →

Reformer: The Efficient Transformer for Long Sequences

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Reformer: The Efficient Transformer for Long Sequences

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts