Machine learningDeep learning / NLP / CV

Multimodal LSTM

Multimodal LSTM extends the standard Long Short-Term Memory network to jointly process sequential data from multiple input modalities — such as text, audio, and video — within a unified recurrent architecture. By fusing representations from different sources before or within the LSTM cells, it captures temporal dependencies that span and cross modalities, making it a foundational approach for tasks like sentiment analysis, video captioning, and affective computing.

MethodMind'de açSoonVideoSoon

Tam yöntemi oku

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Rajagopalan, S., Tran, L., Rozgic, V., Narayanan, S., Kumar, A., & Ramakrishna, S. (2016). Extending Long Short-Term Memory for Multi-View Structured Learning. In Proceedings of ECCV 2016. Springer. link
  2. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. DOI: 10.1162/neco.1997.9.8.1735

Related methods

Referenced by

ScholarGateMultimodal LSTM (Multimodal Long Short-Term Memory Network). Retrieved 2026-06-04 from https://scholargate.app/tr/deep-learning/multimodal-lstm