IMAGEBIND: One Embedding Space To Bind Them All论文笔记

这篇具有很好参考价值的文章主要介绍了IMAGEBIND: One Embedding Space To Bind Them All论文笔记。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

论文	https://arxiv.org/pdf/2305.05665.pdf
代码	https://github.com/facebookresearch/ImageBind

1. Motivation

像CLIP这一类的方法只能实现Text-Image这两个模态的 Embedding 对齐，本文提出的ImageBind能够实现六个模态（images, text, audio, depth, thermal, and IMU data）之间的联合Embedding空间对齐。

本文的多模态之间的对齐也不需要专门制作一个数据集，这个数据集中的每个sample都有六种模态的对应数据，这是不现实的，成本太高。本文提出的ImageBind只需要将所有模态全部对齐到Image Embedding，uses pairs of modalities (I, M), where I represents images and M is another modality。

2. Method

2.1 数据构造

(Image-Text) pairs from web-scale (image, text) paired data，参考《Learning transferable visual models from natural language supervision》；
(video, audio) pairs from the Audioset dataset；
(image, depth) pairs from the SUN RGB-D dataset；
(image, thermal) pairs from the LLVIP dataset；
(video, IMU) pairs from the Ego4D dataset；

Since SUN RGB-D and LLVIP are relatively small, we follow [21] and replicate them 50× for training

2.2 align pairs of modalities to image

给定一个 $I_i, M_i)$ pair， $L_i$ 是image， $M_i$ 是其他模态的数据：

IMAGEBIND: One Embedding Space To Bind Them All论文笔记,多模态,embedding,论文阅读

损失函数采用InfoNCE loss：

IMAGEBIND: One Embedding Space To Bind Them All论文笔记,多模态,embedding,论文阅读

In practice, we use a symmetric loss $L_{I,M} + L_{M,I}$ .

最终，We observe an emergent behavior in the embedding space that aligns two pairs of modalities (M1, M2) even though we only train using the pairs (I, M1) and (I, M2).

2.3 模型细节

Image Encoder： Vision Transformer (ViT)
Video Encoder： Vision Transformer (ViT) ， temporally inflate the patch projection layer of the ViT and use 2 frame video clips sampled from 2 seconds. 参考《OmniMAE: Single Model Masked Pretraining on Images and Videos》
Audio Encoder：ViT-B，convert a 2 second audio sampled at 16kHz into spectrograms using 128 mel-spectrogram bins. As the spectrogram is also a 2D signal like an image, we use a ViT with a patch size of 16 and stride 10。参考《AST: Audio Spectrogram Transformer》
Thermal and Depth Encoder：ViT-S ，treat thermal images and depth images as one-channel images
IMU Encoder：extract the IMU signal consisting of accelerometer and gyroscope measurements across the X, Y , and Z axes. We use 5 second clips resulting in 2K time step IMU readings which are projected using a 1D convolution with a kernel size of 8.
The resulting sequence is encoded using a Transformer
Text Encoder：follow the text encoder design from CLIP。