Machine learningDeep learning / NLP / CV

Multimodal Semantic Segmentation

Multimodal semantic segmentation assigns a semantic class label to every pixel in a scene by fusing information from two or more sensor modalities — most commonly RGB images paired with depth maps (RGB-D), LiDAR point clouds, thermal cameras, or text descriptions. Deep encoder-decoder networks learn to align and fuse complementary cues from each modality, producing denser and more accurate segmentation than any single-modality approach.

MethodMind'de açSoonVideoSoon

Tam yöntemi oku

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. In Proceedings of the Asian Conference on Computer Vision (ACCV). Springer. link
  2. Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., & Stiefelhagen, R. (2023). CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. IEEE Transactions on Intelligent Transportation Systems, 24(12), 14801–14813. DOI: 10.1109/TITS.2023.3300537

Related methods

Referenced by

ScholarGateMultimodal Semantic Segmentation (Multimodal Semantic Segmentation (Multi-Sensor Pixel-Level Scene Understanding)). Retrieved 2026-06-04 from https://scholargate.app/tr/deep-learning/multimodal-semantic-segmentation