Machine learningDeep learning / NLP / CV

Multimodal Vision Transformer

Vision Transformer deli sliku na delove (patches) i tretira ih kao reči u rečenici, propuštajući ih kroz slojeve samopažnje. Multimodalno proširenje dodaje drugi tok za drugi modalitet — najčešće tekst — i omogućava da se dva toka međusobno pažljivo prate kroz unakrsnu pažnju. Baš kao što BERT uči odnose između reči, Multimodal ViT uči odnose između vizuelnih delova i lingvističkih tokena, tako da model može da odgovara na pitanja o slici, pretražuje odgovarajući opis ili locira frazu u određenoj oblasti slike.

Otvorite u MethodMindUskoroVideoUskoroDownload slides

Pročitajte celu metodu

Samo za članove

Prijavite se besplatnim nalogom da biste pročitali ovaj odeljak.

Prijavite se

Method map

The neighbourhood of related methods — select a node to explore.

+1 more

Izvori

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). link
  2. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139. link

Kako citirati ovu stranicu

ScholarGate. (2026, June 3). Multimodal Vision Transformer (Multimodal ViT). ScholarGate. https://scholargate.app/sr/deep-learning/multimodal-vision-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side

Citirana u

ScholarGateMultimodal Vision Transformer (Multimodal Vision Transformer (Multimodal ViT)). Preuzeto 2026-06-15 sa https://scholargate.app/sr/deep-learning/multimodal-vision-transformer · Skup podataka: https://doi.org/10.5281/zenodo.20539026