In-depth learning techniques are usually developed to justify certain types of knowledge. Natural language processing (NLP) techniques typically process text data, while computer vision algorithms process images. Widely used in-depth learning techniques tend to rely on unique architectures to handle different structures in these different data formats. The human mind, on the other hand, can simultaneously process different types of information for certain tasks by utilizing information from one area to improve understanding of another. For example, a person primarily perceives the hardness of an object through touch, but can often recognize that the object is hard based on sight and sound. Multimodal learning research focuses on the development of models that combine multiple data formats into different structures, such as sequential relationships between words in natural language and spatial pixel relationships in images. These models seek to create common presentation data from input data that provide richer features for downstream tasks compared to models that use a single data mode. In this post, we present multimodal learning approaches as well as potential applications.

The need for suitable multimodal presentations in deep learning

Multimodal data sources are very common. As Gao et al. (2020), a sports news article for a particular match uses images to represent specific moments of excitement and text to describe the event log. Presenting these two raw data gives the reader a better idea of ​​the match. In conclusion, handling multiple data sources would give machine learning models a similar advantage over single-form counterparts, especially when a single data source is incomplete.

A primary in-depth learning task that could benefit from multimodal data fusion is feature extraction. Modern in-depth learning techniques typically involve deep development and training of neural network architectures for a discriminatory task such as classification. These models, once trained in large amounts of data, can be adapted to other downstream tasks. Model intermediates or embeddings can be decompressed and used to present input data. These embeddings are effective because they can provide richer information than rule-based or fast-coded vectors, which are typically used to represent categorical data in traditional machine learning algorithms, increasing performance and reducing additional model requirements. The values ​​of these embeddings encode high-level information about the language or images used to create them (e.g., language and object classes for images). Embeddings from large prefabricated models such as BERT (Devlin et al. 2018) are significant in industry and academia for data retrieval, grouping, and classification. Although all of these widely used immersions are uniform, progress has been made in producing multimodal immersions. Multimodal embedding, like its properties, can be used in a variety of downstream classification, retrieval, and clustering tasks, but can provide a richer representation. For example, a classifier trained in images and captions can provide a stronger representation of the context, even if one of the inputs is missing or incomplete during the conclusion. Dense single-modal embedding of text and images has had a significant impact on the field, so it is believed that the creation of multimodal embedding could produce similar results.

Creating multimodal embeddings directly

One approach to creating multimodal presentations is to train models from scratch using multiple data sources and the architectures inherent in those modalities. Many supervised and unsupervised architectures have been proposed. Many models are tailored to a specific task and the data formats available, which limits the possibility of generalization to other tasks. Ngiam et al. (2011) presented an uncontrolled approach in which they used the layers of Boltzman machines as an auto-encoder to create raw data. In these models, text or images are used as input and transferred through a deep Boltzman network. Because they are an uncontrolled model, they do not require marked training data. Similarly, there are examples that use a transformer architecture and a monitoring mechanism so that the significance of certain parts of the image is measured relative to the corresponding identifiers (Pramanik et al.2019). Gao et al. 2020 provides a comprehensive study of different educated architectures and applications.

Inclusion and merging of prefabricated unimodal dips

While creating multimodal embedding from scratch can be useful, fusing rich unimodal embedding allows models to take advantage of rich feature pickers developed for large amounts of unimodal data to create multimodal representations. Fusion allows the results of multiple templates (e.g., embedded text and image embedding) to be incorporated into new templates for downstream tasks. One commonly used prefabricated model is BERT (Devlin et al.2018). Practicing BERT (or similar large models) from the beginning is not only expensive and time consuming, but also requires a lot of training knowledge. The use of the latest image and text embeddings is thus cost-effective and allows higher quality information to be compressed into embedding vectors. One limitation, however, is that single-modal immersions do not have properties that are based on multiple data formats together, i.e., the final multimodal properties may not be as rich.

As Zhang et al. (2019), there are many approaches to combining immersions. In all of these approaches, the fusion model takes as input the unimodal embeddings and returns the output for some multimodal tasks. A simple method is to use basic functions such as concatenation or weighted sums. Alternatively, the neural network can even be trained from fused unimodal outputs to find a pseudo-optimal configuration (i.e., optimal without altering pickers for upstream single-modal properties). Alternatively, the model may transport single-modal embeddings through the transformer on a purely attention-based basis so that the resulting drawing vector is a common representation of multiple modalities. Attention mechanisms can also be combined with repetitive neural layers, which limits the number of features that must be involved. Zhang et al. also discuss the use of common pooling functions in convolutional neural network structures for more expressive immersions. Since the merging corresponds to the outer product of unimodal vectors, the result of this operation is a large M x N matrix. This greatly increases the complexity of the calculation and the training costs. In all of these approaches, typically the last layer of the network before departure can be used as multimodal embedding.


Multimodal embedding can be valuable wherever you may have multiple input data states that can indicate an end-to-end task, especially when some modalities are deficient. One obvious example is the subtitling of an image when the images are accompanied by text. In the caption, a short text description is created from the image. This has many uses, such as helping the visually impaired or creating accurate, searchable descriptions from the ever-growing crowd of visual media available online. For this task, we can utilize not only the image itself but also any accompanying text, such as a news article or a reactionary social media message. In this example, the image and associated text would first be encoded with an image and text embedding pattern. The resulting embeddings could then be combined and passed through a purpose-trained repetitive pattern to create subtitles or alternative text. Ideally, this model would work with sub-data and appear to perform better compared to its single-form counterpart because the input data is richer. This example is just one of many opportunities for multimodal in-depth learning. Visual question answering and visual reasoning are some of the more methodologically interesting applications that researchers work on (i.e., they are given a picture and a question, given a text answer). Every application has problems, but learning to create multimodal embeddings and developing architectures are important steps forward. As in-depth learning continues to permeate technology in modern society, it is increasingly important that these models be able to handle multiple, often incomplete, sources of information.


Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal Deep Learning,” in Proc. ICML, 2011

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Preliminary Training of Deep Bidirectional Transformers for Language Understanding” arXiv: 1810.04805, 2018

Chao Zhang, Zichao Yang, Xiaodong He, Li Deng, “Multimodal Intelligence: Representation Learning, Information Fusion, and Applications” arXivissa: 1911.03977 2019

Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang, “Survey on Deep Learning for Multimodal Data Fusion,” Neural Computation 2020 32: 5, 829-864


Please enter your comment!
Please enter your name here