

In this way, contrastive learning tries to identify the parts of each audio or video that are most relevant to the other.įor example, if a video shows someone speaking and the corresponding audio clip contains speech, the autoencoder will learn to associate the mouth movements of the speaker with the words being spoken. In a similar fashion to masked autoencoding, audio-visual pairs are passed into separate modality encoders however, the audio and visual components are kept separately within the joint encoder before the model performs pooling and contrastive loss.

For example, the model will attempt to place different video and audio data of different parrots close to each other and further away from pairs of video and audio of guitars playing. Unfortunately, this method may not capture the association between the video and audio pair, whereas contrastive learning leverages this, but may discard some modality-unique information, like the background in a video.Ĭontrastive learning aims to map representations that are similar close to each other. The difference (reconstruction loss) between the resulting reconstructed prediction and the original audio-visual combination is then used to train the model for better performance.Īn example of this would be covering part of a video of a piano and part of a spectrogram of piano music, and then asking the model to try to determine the masked inputs. The unmasked data is tokenized, then fed into separate audio and visual encoders before entering a joint encoder/decoder, where the model is asked to recover the missing data. The masked data modeling, or the prediction method, takes a video along with its coordinated audio waveform, converts the audio to a spectrogram, and masks 75% of both. The CAV-MAE works by "learning by prediction" and "learning by comparison," says Gong. The method was recently presented at the International Conference on Learning Representations. Kuehne is also affiliated with Goethe University Frankfurt. '18 of the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne.

Joining Gong and Glass on the study are graduate students Andrew Rouditchenko and Alexander H. The researchers say the technique is more effective than previous approaches because it explicitly models the relationships between audio and visual data in a way that other methods do not. The technique, called the contrastive audio-visual masked autoencoder (CAV-MAE), is a type of neural network that can learn to extract and map meaningful latent representations into high-dimensional space from acoustic and visual data by training on large YouTube datasets of audio and video 10-second clips. And then you can use classical, supervised learning or reinforcement learning to fine tune the model to something particular if you want to," says Jim Glass, an MIT senior research scientist and member of the MIT-IBM Watson AI Lab. "So, another way to put it is that self-supervised learning often forms the foundation of an initial model, because it can learn on vast amounts of unlabeled data. "A larger portion of human knowledge is learned in a self-supervised way, because we don't always get supervision signals, and we want to enable the machine-learning model to have the same ability," says Yuan Gong, an MIT postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL).
