This website uses cookies to ensure you get the best experience. More about it
Silent Revolution
and the New Wild West
in Computer Vision
It would seem that there has already been a revolution in Computer Vision. In 2012 algorithms based on convolutional neural networks made a breakthrough. Since 2014 they have reached production and since 2016 they have filled everything. But at the end of 2020 a new round took place. This time not in four years, but in one. Let's talk about Transformers in Computer Vision.

Silent Revolution
and the New Wild West in Computer Vision
It would seem that there has already been a revolution in Computer Vision. In 2012 algorithms based on convolutional neural networks made a breakthrough. Since 2014 they have reached production and since 2016 they have filled everything. But at the end of 2020 a new round took place. This time not in four years, but in one. Let's talk about Transformers in Computer Vision.
Transformers are a type of neural networks created in 2017
Initially they were used for translations.
But as it turned out, they worked simply as a universal model of the language. And one thing led to another. Actually the famous GPT-3 is a product of Transformers.
What about Computer Vision?
This just got interesting. I wouldn't say that Transformers are well suited for such tasks. After all, these are one-dimensional time series. But they work too good in other tasks. Here I will go through the most key works, interesting points in their application. I will try to tell you about different options for how Transformers were able to be cramed into CV.
It's 2020. A breakthrough. Why? It's hard to say. But I think we should start with DETR (End-to-End Object Detection with Transformers) which was released in May 2020. Here the Transformers are not applied to the image but to the features selected by the convolutional network.
This approach is not particularly new, ReInspect in 2015 did something similar feeding the output of the BackBone network to the input of the recurrent neural network. But how much the recurrent network is worse than the Transformer — just as much ReInspect lost to Detr. The accuracy and convenience of training for Transformers has increased significantly.
Of course there are a couple of funny things that no one did before DETR (for example how positional coding is implemented, which is necessary for a Transformer). I described my impressions here.
I can only add that DETR opened the way for ComputerVision to use Transformers. Has it been used in practice? Does it work now? I don't think so:
  1. It’s main problem is complex training, long training time. This problem was partially solved by Deformable DETR.
  2. DETR is not universal. There are problems where other approaches work better. For example the same iterdet. But in some tasks it still holds leadership (or its derivatives —
Immediately after DETR Visual Transformer (article + good review) came out for classification. Here the Transformers also take the output Feature map from the standard backbone.
I would not call Visual Transformer a big step but it is a typical idea for those times — to try to apply the transformer to certain features selected through the backbone.
Let's go further. The next big step is ViT.

It was published in early December 2020 (implementation). And here everything is already in an adult way. Transformer as it is. The picture is divided into mini-sections 16 * 16. Each section is fed into the Transformer as a “word” supplemented by a positional encoder.
Suddenly it all worked. Without consideration of the fact that everything was studied for a long time (and the accuracy is not state-of-art). And on the bases of less than 14 million images it worked not too good.
But all these problems were solved by an analog. This time — the Deit from FaceBook which greatly simplified training and inference.
On large datasets this approach still holds the first places in almost all classifications.
In practice we tried to use it in one task. But with a dataset of nearly 2-3 thousand pictures all this did not work very well. And classic ResNet was muchbetter and more stable.
This is a very interesting application of Transformers from a completely different side. In the CLIP the task has been reformulated: not to recognize the image, but to find the closest possible textual description for it. Here the Transformer learnes the linguistic part of embedding and the convolutional network learnes visual embeddings:
Such a thing takes a very long time to learn, but it turns out to be universal. It does not degrade when the dataset is changed. The network is able to recognize things that it saw in a completely different form:

Sometimes it works too cool:

But while this works well on some datasets it is not a universal approach:
Here is a comparison with the linear approximation of ResNet50.It should be notedthat for some datasets it works much worse than a model trained on 100 pictures.
Out of interest we tried to test it on several tasks, for example recognition of actions / clothes. And everywhere the CLIP works very badly. In general, we can talk about the CLIP for a very long time. There is a good article on Habr. And I made a video where I spoke about it.
Vision Transformers
for Dense Prediction
In my opinion the next grid is significant — «Vision Transformers for Dense Prediction», which came out a month ago. Here you can switch between the Vit or Detr sets. You can use convolutions or Transformersfor the first level.
In this case the grid is not used for detection/classification, but for segmentation/depth estimation. Which gives State-of-art results for several categories at once moreover in RealTime.

In general the grid is nice, it runs “out of the box”, but I haven't tried it anywhere yet. If anyone is interested, I did a more detailed review here.
Transformers are used to process the output of the convolutional network.
Transformers are used to find logic over the network output
Transformers are used directly applied to the image
All that was above are the most striking examples of the main approaches to using Transformers:
A hybrid of approaches 1-2
Everything below is examples of how the same Transformers/approaches are used for other tasks.
Pose3D. The Transformer can also be applied to explicit features allocated by a ready-made network, for example to skeletons.
In this article the Transformer is used to restore a 3d model of a person from a series of frames. In CherryLabs we did this (and more complex reconstructions) three years ago, but without transformers, only with embeddings. But, of course, Transformers allow you to do this faster and more stable. The result is quite good and stable 3D without retraining.
The advantage of Transformers in this case is the ability to work with data that does not have local correlation. Unlike neural networks (especially convolutional networks). This allows the Transformer to be trained on complex and varied examples.

This idea came to many people at the same time. Here's another approach/implementation of the same idea.
If you look at where convolutional networks are missing precisely because of an embedded internal logic of the image, then the pose immediately comes to mind. TransPose — a network that recognizes a pose by optimizing convolutions.
Compare it with classic approaches in pose recognition (old enough version of OpenPose)
And there were up to ten such stages in different works. They have now been replaced by one Transformer. It turns out much better than modern networks.
We have already mentioned one segmentation grid based on Transformers from Intel. SWING from Microsoft shows better results, but not in RealTime. In fact this is an improved and expanded VET/Death redesigned for segmentation.
This affects the speed, but gives an impressive quality, leadership in many categories.
There are problems in which convolutional networks don't work very well. For example the task of matching two images. About a year and a half ago we often used for it the classic pipeline via SIFT/SURF+RANSACK (a good guide on this topic + a video that I recorded a year ago). SuperGlue came out a year ago — the only cool Graph Neural Network application I've seen at ComputerVision. Meanwhile SuperGlue only solved the problem of matching. And now there is a solution on Transformers, LOFTR is practically End-To-End.
It looks cool.
Activity detection
In general Transformers are good wherever sequences and complex logical structures are or where its analysis is required. There are already several networks where actions are analyzed by Transformers (Video Transformer Network, ActionBert). It is promised to add them to MMAction in the near future.
A year ago I wrote a huge article on Habr about tracking and how to track objects. There are many approaches, complex logic. Only a year has passed and there is an absolute leader — STARK — according to many benchmarks.
Of course, it does not solve all cases. And Transformers don’t won in all cases. But, most likely, it will not last long. For example, eye tracking on Transformers. Here's a tracking like Siamese neural networks which was done a couple of weeks ago. Here is BBOX tracking + features from one, and here is from another, with almost the same names.



And all have good scores.
Re-identification can be taken out of tracking, as you remember. Twenty days ago a Transformer with ReID recognition was released — it can boost tracking quite well.
Facial recognition through the Transformers of a week ago seems to have also come up.
If you look for more specific applications, there are also a lot of interesting things. VIT is already being used for CT scan and MRI analysis (1,2).
And for segmentation (1,2).
What surprises me is that I don't see a good OCR implementation on Transformers. There are several examples, but according to benchmarks they are bottom out.
All state-of-art is still on classical approaches. But people are trying to do something with it. Even I tried to screw something about two years ago. But it doesn’t give any result.
More interesting things
I never would have thought, but Transformers have already been used for coloring pictures.That's probably a good thing to do.
What's next
It seems to me that Transformers should reach the top in almost all categories for Computer Vision. And of course for any video analytics.

Transformers consume input data linearly. They store spatial information in various tricky ways. But it seems that sooner or later someone will come up with a more relevant implementation, perhaps where the Transformer and the 2D convolution will be combined. And people are already trying.
Now we are looking at how the world is changing every day.
Written by Anton Maltsev
Did you like this article?
Let`s get in touch!
Pb 1010 Lura, Sandnes,
Rogaland 4391, Norway
+47 94839463
RemBrain © 2021
Click to order
Let's meet
We will be happy to present you RemBrain Cloud
Your Name
Name of the company
Made on