Machine learningDeep learning / NLP / CV

Long Short-Term Memory (LSTM)

Long Short-Term Memory Network (LSTM) · Also known as: LSTM, LSTM network, LSTM-RNN, long short-term memory RNN

Long Short-Term Memory (LSTM) is a gated recurrent neural network architecture introduced by Hochreiter and Schmidhuber in 1997. It was designed to learn dependencies across long sequences by using dedicated memory cells and three learned gates — forget, input, and output — that control what information is retained, updated, or passed forward at each time step.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Long Short-Term Memory

BERT-based Classification Gated Recurrent Unit Recurrent Neural Network Sentence Embeddings Domain-adaptive Recurren…Explainable GRU Explainable LSTM Explainable Recurrent Ne…Fine-Tuned GRU Fine-Tuned LSTM

+13 more

When to use it

LSTM is well-suited for tasks where the order of observations matters and relevant context may span many steps: text classification, sentiment analysis, named-entity recognition, machine translation, time-series forecasting, and speech processing. Choose LSTM when your sequences are moderately long (tens to a few hundred steps) and you have enough labeled data (typically thousands of examples) to learn gate parameters. For very long sequences or when parallelism is critical, the Transformer is usually preferable. For shorter sequences or when simplicity matters, a GRU (fewer parameters) or a fine-tuned pretrained model (BERT, RoBERTa) may be more appropriate.

Strengths & limitations

Strengths

Explicitly designed to capture long-range dependencies without gradient vanishing.
Flexible architecture: works for sequence classification, token labeling, generation, and forecasting.
Interpretable gating structure — forget, input, and output gates — relative to black-box alternatives.
Bidirectional extension (BiLSTM) doubles context by processing sequences in both directions.
Large body of proven applications across NLP, speech, and time-series domains.

Limitations

Slower to train than Transformers because sequential computation cannot be fully parallelized.
Requires substantial labeled data; on small datasets it tends to overfit despite dropout.
Very long sequences (hundreds of steps) still challenge LSTMs due to limited cell capacity.
Hyperparameter tuning — hidden units, layers, dropout, learning rate — can be time-consuming.
For many NLP tasks, pretrained Transformer models now outperform LSTMs trained from scratch.

Frequently asked

What is the difference between LSTM and GRU?

A GRU merges the forget and input gates into a single update gate and has no separate cell state, giving it fewer parameters and faster training. LSTMs tend to slightly outperform GRUs on tasks requiring very precise long-term memory, but the difference is often small and dataset-dependent.

Should I use a bidirectional LSTM?

If your task allows access to the full sequence before making predictions — classification, NER, or offline forecasting — a BiLSTM typically improves accuracy by incorporating both past and future context. If predictions must be made in real time (online forecasting, streaming), use a unidirectional LSTM.

When should I choose a Transformer over an LSTM?

For most modern NLP tasks with sufficient compute, pretrained Transformers (BERT, RoBERTa) outperform LSTMs trained from scratch. Choose an LSTM when compute or latency is constrained, when your sequences are numerical time-series rather than text, or when you have too little data to fine-tune a large pretrained model.

How do I handle variable-length sequences?

Pad shorter sequences to a common length, create a mask to ignore padding tokens in loss computation, and use packed sequences (PyTorch) or masking layers (Keras/TensorFlow) so that padded positions do not affect the hidden state.

How many LSTM layers and hidden units should I use?

For most tasks one to two stacked LSTM layers with 64–512 hidden units is a sensible starting range. More layers and units increase capacity but also training time and the risk of overfitting; use dropout between layers and validate with a held-out set or cross-validation.

Sources

Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. DOI: 10.1162/neco.1997.9.8.1735 ↗
Graves, A., Mohamed, A.-R. & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. Proceedings of ICASSP 2013, pp. 6645–6649. IEEE. DOI: 10.1109/ICASSP.2013.6638947 ↗

How to cite this page

ScholarGate. (2026, June 3). Long Short-Term Memory Network (LSTM). ScholarGate. https://scholargate.app/en/deep-learning/long-short-term-memory

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

BERT-based ClassificationDeep learning↔ compare
Gated Recurrent UnitDeep learning↔ compare
Recurrent Neural NetworkDeep learning↔ compare
Sentence EmbeddingsDeep learning↔ compare

Compare side by side →

Related reference concepts

Sequence-to-Sequence Models and Transformers Convolutional and Sequence Models Deep Learning Part-of-Speech Tagging and Sequence Labeling Automatic Speech Recognition Language Modeling

Spotted an issue on this page? Report or suggest a fix →

Machine learningDeep learning / NLP / CV

Long Short-Term Memory (LSTM)

Long Short-Term Memory Network (LSTM) · Also known as: LSTM, LSTM network, LSTM-RNN, long short-term memory RNN

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Long Short-Term Memory

+13 more

When to use it

Strengths & limitations

Strengths

Explicitly designed to capture long-range dependencies without gradient vanishing.
Flexible architecture: works for sequence classification, token labeling, generation, and forecasting.
Interpretable gating structure — forget, input, and output gates — relative to black-box alternatives.
Bidirectional extension (BiLSTM) doubles context by processing sequences in both directions.
Large body of proven applications across NLP, speech, and time-series domains.

Limitations

Slower to train than Transformers because sequential computation cannot be fully parallelized.
Requires substantial labeled data; on small datasets it tends to overfit despite dropout.
Very long sequences (hundreds of steps) still challenge LSTMs due to limited cell capacity.
Hyperparameter tuning — hidden units, layers, dropout, learning rate — can be time-consuming.
For many NLP tasks, pretrained Transformer models now outperform LSTMs trained from scratch.

Frequently asked

What is the difference between LSTM and GRU?

Should I use a bidirectional LSTM?

When should I choose a Transformer over an LSTM?

How do I handle variable-length sequences?

How many LSTM layers and hidden units should I use?

Sources

Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. DOI: 10.1162/neco.1997.9.8.1735 ↗
Graves, A., Mohamed, A.-R. & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. Proceedings of ICASSP 2013, pp. 6645–6649. IEEE. DOI: 10.1109/ICASSP.2013.6638947 ↗

How to cite this page

ScholarGate. (2026, June 3). Long Short-Term Memory Network (LSTM). ScholarGate. https://scholargate.app/en/deep-learning/long-short-term-memory

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

BERT-based ClassificationDeep learning↔ compare
Gated Recurrent UnitDeep learning↔ compare
Recurrent Neural NetworkDeep learning↔ compare
Sentence EmbeddingsDeep learning↔ compare

Compare side by side →

Related reference concepts

Sequence-to-Sequence Models and Transformers Convolutional and Sequence Models Deep Learning Part-of-Speech Tagging and Sequence Labeling Automatic Speech Recognition Language Modeling

Spotted an issue on this page? Report or suggest a fix →

Long Short-Term Memory (LSTM)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Long Short-Term Memory (LSTM)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts