Machine learning

Longformer / BigBird

Long-Sequence Transformers with Sparse Attention (Longformer / BigBird) · Also known as: Uzun Dizi Transformer (Longformer / BigBird), uzun dizi transformer, long-document transformer, sparse-attention transformer

Long-sequence Transformers such as Longformer (Beltagy, Peters & Cohan, 2020) and BigBird (Zaheer et al., 2020) replace the standard Transformer's O(n²) attention with sparse attention patterns that scale linearly, O(n), with sequence length. This lets a single model attend over thousands of tokens — full documents, legal texts, or genomic sequences — that would not fit a conventional Transformer.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Longformer / BigBird

Graph Attention Network Mixture of Experts Random Forest XGBoost Knowledge Distillation Neural Architecture Sear…Visual Contrastive Learn…

When to use it

Use long-sequence Transformers when you have long-document text (typically beyond 512 tokens) and need classification or explanation over the full document — long reports, legal texts, or genomic sequences. They assume genuinely long inputs are available, a GPU is strongly recommended, and global token positions must be chosen sensibly. They need a reasonable amount of data: below about 500 documents the model tends to overfit, and below about 100 documents training is not meaningful and classical ML such as Random Forest or XGBoost is preferable.

Strengths & limitations

Strengths

Processes thousands of tokens — whole documents — that exceed a standard Transformer's window.
Linear O(n) attention scaling instead of the quadratic O(n²) cost of dense attention.
Combines local sliding-window context with global tokens for long-range information flow.
Well suited to long-document classification in law, genomics, and long-form text.

Limitations

A GPU is effectively required; training and inference are computationally heavy.
Overfits on small document sets — below about 500 documents results are unreliable.
Below about 100 documents training is meaningless and classical ML should be used instead.
Global token positions must be chosen correctly, or long-range information flow degrades.

Frequently asked

How is this different from a standard Transformer like BERT?

A standard Transformer uses dense attention, comparing every token with every other token at O(n²) cost, which in practice caps inputs around 512 tokens. Longformer and BigBird use sparse attention — mostly local sliding windows plus a few global tokens — so cost grows linearly and inputs of several thousand tokens become feasible.

When should I prefer Longformer or BigBird over classical ML?

Use them when your text is genuinely long (beyond 512 tokens) and you have enough documents. With fewer than about 500 documents the model tends to overfit, and below roughly 100 documents training is not meaningful — classical methods such as Random Forest or XGBoost are the safer choice there.

What are global tokens and why do they matter?

Global tokens are a small set of positions allowed to attend to, and be attended by, the entire sequence. They are the channel through which long-range information moves across the document, so choosing their positions correctly is important for performance.

Do I need special hardware?

A GPU is strongly recommended. These models still process long sequences with sizeable networks, so training and inference are computationally demanding even with the linear-scaling attention.

Sources

Beltagy, I., Peters, M. E. & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv. link ↗
Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS. link ↗

How to cite this page

ScholarGate. (2026, June 1). Long-Sequence Transformers with Sparse Attention (Longformer / BigBird). ScholarGate. https://scholargate.app/en/deep-learning/longformer-bigbird

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Graph Attention NetworkDeep learning↔ compare
Mixture of ExpertsDeep learning↔ compare
Random ForestMachine learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Referenced by

Knowledge Distillation Neural Architecture Search Visual Contrastive Learning

Related reference concepts

Sequence-to-Sequence Models and Transformers Convolutional and Sequence Models Neural Language Models and Word Embeddings Statistical and Neural NLP Part-of-Speech Tagging and Sequence Labeling Question Answering and Dialogue Systems

Spotted an issue on this page? Report or suggest a fix →

Machine learning

Longformer / BigBird

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Longformer / BigBird

Graph Attention Network Mixture of Experts Random Forest XGBoost Knowledge Distillation Neural Architecture Sear…Visual Contrastive Learn…

When to use it

Strengths & limitations

Strengths

Processes thousands of tokens — whole documents — that exceed a standard Transformer's window.
Linear O(n) attention scaling instead of the quadratic O(n²) cost of dense attention.
Combines local sliding-window context with global tokens for long-range information flow.
Well suited to long-document classification in law, genomics, and long-form text.

Limitations

A GPU is effectively required; training and inference are computationally heavy.
Overfits on small document sets — below about 500 documents results are unreliable.
Below about 100 documents training is meaningless and classical ML should be used instead.
Global token positions must be chosen correctly, or long-range information flow degrades.

Frequently asked

How is this different from a standard Transformer like BERT?

When should I prefer Longformer or BigBird over classical ML?

What are global tokens and why do they matter?

Do I need special hardware?

A GPU is strongly recommended. These models still process long sequences with sizeable networks, so training and inference are computationally demanding even with the linear-scaling attention.

Sources

Beltagy, I., Peters, M. E. & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv. link ↗
Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS. link ↗

How to cite this page

ScholarGate. (2026, June 1). Long-Sequence Transformers with Sparse Attention (Longformer / BigBird). ScholarGate. https://scholargate.app/en/deep-learning/longformer-bigbird

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Graph Attention NetworkDeep learning↔ compare
Mixture of ExpertsDeep learning↔ compare
Random ForestMachine learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Referenced by

Knowledge Distillation Neural Architecture Search Visual Contrastive Learning

Similar methods

Related reference concepts

Spotted an issue on this page? Report or suggest a fix →