The computer vision industry has been dominated for years by Convolutional Neural Networks (CNN). Filters allow these networks to create simplified versions of the input image by creating feature maps that highlight the most essential parts. The multilayer perceptron then uses these properties to perform the desired classification.

But recently, this field has revolutionized an incredible amount of Vision Transformers (ViT) architecture, which has proven to achieve excellent results in many tasks.

If this in-depth tutorial is helpful to you, subscribe to our AI survey mailing list warns when we publish new material.

In this article, some basic aspects of Vision Transformers are taken for granted, if you want to go deeper into the topic, I recommend reading my previous review of architecture.

While transformers have proven to be excellent substitutes for CNN devices, there is a major limitation that makes their application quite challenging, the need for large data sets. In fact, CNNs are able to learn even in the presence of a relatively small amount of data, mainly due to the presence of inductive biases [1, 8]. These are like suggestions to help models learn faster and generalize better. In particular, CNN has two delusions directly related to the operation of architecture, namely:

  • The neighboring pixels in the image are interconnected;
  • The different parts of the image must be treated in the same way, regardless of their absolute location.

However, these biases are not present in Transformers architecture, so they need more knowledge to fully understand the problem, but at the same time they are able to do it more freely. Thus, it could be said that transformers are able to learn more but require more data, while Convolutional Neural Networks companies understand the task better, but also do it with smaller data moles.

Image by the author

But isn’t there a way to get the most out of both architectures? Fortunately for us, these two architectures, based on two very different concepts, can be combined in many different ways to get something that is able to take advantage of the positive sides of both!

Using CNNs as a patch

The first possible approach is to change the way the patches are extracted before they are transferred as input to Vision Transformer. These patches are usually obtained by separating the print image into many small parts.

To understand how to move from image to patch through a convolutional network, it is sufficient to observe its internal function:

Image by the author

When a large image is fed to CNN through convolution layers, it is converted to a three-channel RGB image to an N-channel image. At the same time, its size is greatly reduced and the content of the image changes.
If, at the end of the convolution process, the N-channel image is considered as a set of N small images, we have obtained the necessary corrections for the Vision Transformer. The new architecture of a possible Convolutional Vision Transformer is therefore as follows:

This technique has proven to be particularly effective in many cases and can also be applied using pre-trained convolutional networks such as EfficientNet patches. The possible application of this approach has been applied by me and the researchers at the CNR of Pisa to detect deep video forgeries. [2]if you want to learn more about it Click here.

From self-monitoring to portal location (GPSA)

In order to utilize convolutional networks, transformers utilize the intuition that self-aware layers can act as convolutional layers. We have previously pointed out that Vision Transformers do not have inductive biases. The goal of the Facebook researchers was therefore to modify the architecture to produce a soft convolutional inductive bias. The new network must be able to act as a convolutional network if necessary.

To achieve this goal, Portal Local Self-Focus (GPSA) [1] was introduced, local self-consideration with an additional parameter, lambda. This parameter is used to balance the layer to act as a convolution layer or a classical self-consideration. During the exercise, the network calibrates this parameter and, if necessary, at the end of the process, some of these layers act as convolution layers.

In addition to the GPSA layers used in the instance to capture local information in the input, there are also classical self-focusing layers that form a non-local part of the network. This architecture is called the Convolutional Vision Transformer (ConViT).

CMT: Convolutional Neural Networks Meet Vision Transformers

Another recent proposal came from Huawei’s laboratories, which present an even more advanced architecture than hitherto, and present a so-called CMT block. [3]. Many of these blocks are used in the new architecture, and they combine the self-consideration mechanism with the convolution mechanism and also introduce some performance optimizations.

Each CMT block consists of three basic parts:

  • Local observation unit: Uses the limitations imposed by the classical positioning application and the inability of classical vision transformers to capture local relationships and structured data using individual patches. The Local Perception Unit (LPU) collects local information by simple depth-of-turn convolution.
  • Lightweight multi-headed self-monitoring: To reduce the computational load in computational attention, this component is used to reduce the state size of the matrices K and V using akxk-depth convolution with k steps. In this way, the number of self-monitoring calculations is reduced by processing smaller matrices resulting from the convolution process;
  • Reverse residual transmission network: This is the last layer of each block and replaces the classic multilayer perceptron of visual transformers with an expansion layer followed by depth convolution and a reflection layer.

The resulting architecture is thus able to take advantage of the best of both networks and does so efficiently thanks to the different properties of the different layers.


The idea of ​​combining convolutional grids and Vision Transformers combinations doesn’t seem feasible in many ways, but also incredibly effective. To date, these options have achieved excellent results in key data sets such as ImageNet, and CMT is currently the latest network in terms of database accuracy. Just as this would not be enough, the experiments performed show that these networks are also considerably lighter and smaller than both classical approaches based solely on convolutional networks and Vision Transformers.

Many have seen Vision Transformers as the successor to Convolutional Neural Networks, but today it seems that the combination of the two approaches has tremendous power.

We can certainly say: “Cooperation is power!”.

[1] “D’Ascoli et al.” “ConViT: Enhancing vision transformers with soft, revolutionary inductive biases

[2] “Coccomini et al.” “Combining EfficientNet and Vision converters for video high definition

[3] “Guo et al.” “CMT: Convolutional Neural Networks Meet Vision Transformers

[4] “Davide Coccomini.” “About transformers, time formers, and attention

[5] “Davide Coccomini.” “DINO, self-distillation without markings

[6] “Davide Coccomini.” “Is there attention you really need in Transformers?

[7] “Louis Bouchard.” “Will transformers replace computer vision CNNs?

[8] “Victor Perez.” “Transformers in computer TVs: goodbye!

This article was originally published Towards science and republished in TOPBOTS with the permission of the author.

Enjoy this article? Order more AI updates.

We will notify you when we publish more technical training.


Please enter your comment!
Please enter your name here