To quantify the impact of visual biases on the performance of VLMs, we compare VLMs with different vision encoders on commonly evaluated VQA and captioning tasks. For this, we develop a pre-training, fine-tuning and evaluation setup, as explained in the paper. To the best of our knowledge, this is the first unified and comprehensive evaluation of different vision encoders for vision-language understanding and generation tasks.
We consider eight recently introduced vision encoders: CLIP, OpenCLIP, EVA-CLIP, SIGLIP, SILC, ViT-e, ViT-G, and DINOv2. While they all use ViT-based backbones, they differ in terms of 1) training data, e.g. LAION, WebLI, LVD-142M, etc., 2) training objective, e.g. image-text contrastive learning, masked image modelling, classification, etc., and 3) model size, e.g. 300M to 4B parameters. Due to this diversity, they incorporate different biases and thus potentially capture different aspects of the underlying depicted scene. The results are summarized in the table below.
Vision encoder | COCO Cap. ↑ Karpathy val |
VQAv2 ↑ Karpathy val |
OKVQA ↑ val |
GQA ↑ test-dev |
MMVP ↑ test |
---|---|---|---|---|---|
CLIP-L/14 | 133.0 | 74.4 | 61.0 | 48.7 | 15.3 |
OpenCLIP-G | 128.3 | 73.3 | 60.6 | 48.0 | 22.0 |
EVA-CLIP-g | 140.9 | 77.0 | 63.0 | 50.1 | 27.3 |
SIGLIP-G/14 | 133.0 | 74.7 | 62.5 | 48.6 | 24.0 |
SILC-G/16 | 141.1 | 77.0 | 63.4 | 49.7 | 24.0 |
ViT-e | 137.8 | 75.6 | 61.9 | 49.1 | 25.3 |
ViT-G | 133.8 | 74.2 | 61.2 | 48.3 | 20.7 |
DINOv2-L/14 | 127.6 | 71.3 | 59.0 | 48.0 | 22.0 |
While we defer the detailed results and takeaways to the paper for brevity, we note the following two key observations:
We answer positively and introduce BRAVE to achieve this efficiently, as explained next.
Motivated by the benchmarking results, we propose using multiple vision encoders with different, and potentially complementary, visual biases to create more capable VLMs. To do this, we introduce BRAVE, a method that combines the strengths of different vision encoders while staying efficient in terms of number of trainable parameters. Below, we provide a summary of how BRAVE works. Please see the paper for details.
As shown in the animation below, we introduce the multi-encoder querying transformer, or MEQ-Former, that combines visual features from an arbitrary set of encoders and outputs a fixed-length visual representation that is given as a soft prompt to a frozen LM. It takes as an input a sequence comprised of a fixed number of learnable queries and embedded text tokens which describe the task. MEQ-Former interacts with visual features through cross-attention layers. The visual features from different encoders are first linearly projected to the same dimension and then concatenated sequence-wise. The obtained concatenated feature is given as a key and value pair to the cross-attention layers in MEQ-Former, and is cross-attended by its query sequence. This resampling enables efficiently processing a large number of visual features since it bypasses the quadratic complexity of self-attention. It also act as a "bottleneck" that keeps the total number of VLM's parameters low compared to a naive ensembling of VLMs. For the main results, we combine five vision encoders, namely EVA-CLIP-g, CLIP-L/14, SILC-G/16, ViT-e, and DINOv2-L/14 to cover all training datasets and objectives from the benchmarking study.
We perform evaluations on a broad range of captioning and VQA tasks to show the effectiveness of BRAVE. The figure below provides an overview of the tasks.
We summarize the captioning evaluations in the table below (best in bold, second best is underlined). BRAVE uses the least number of trainable parameters (116M) yet achieves strong results for both COCO and NoCaps benchmarks. For NoCaps, BRAVE is the best performing method with significant gains over recent methods. This is especially the case for out-domain samples with novel classes, demonstrating the usefulness of diversity in visual features for robustness. For COCO, BRAVE stays competitive with the best performing model, PaLI-17B. Notably, BRAVE achieves this while using 150x fewer trainable parameters (116M vs 16.9B), 16x less pre-training data (100M vs 1.6B) and 3x less image pixels (336x336 vs 588x588) than PaLI-17B, suggesting that having different visual biases is effective for generalization, while keeping the sample complexity low.
Method |
# params | COCO (fine-tuned) | NoCaps (zero-shot, val) | NoCaps (zero-shot, test) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Trainable | Total | Karpathy test | out-domain | overall | out-domain | overall | |||||
Flamingo | 10.6B | 80B | 138.1 | - | - | - | - | ||||
SimVLM | 632M | 632M | 143.3 | 113.7 | 112.2 | - | 110.3 | ||||
Qwen-VL | 9.6B | 9.6B | - | - | 121.4 | - | - | ||||
BLIP-2 | 1.1B | 4.1B | 144.5 | 124.8 | 121.6 | - | - | ||||
InstructBLIP | 188M | 14.2B | - | - | 121.9 | - | - | ||||
CoCa | 2.1B | 2.1B | 143.6 | - | 122.4 | - | 120.6 | ||||
GiT2 | 5.1B | 5.1B | 145.0 | 130.6 | 126.9 | 122.3 | 124.8 | ||||
PaLI-17B | 16.9B | 16.9B | 149.1 | - | 127.0 | 126.7 | 124.4 | ||||
BRAVE | 116M | 10.3B | 148.0 | 133.3 | 127.6 | 127.1 | 125.6 |
The figure below shows zero-shot captioning results on NoCaps validation set images. BRAVE outputs accurate descriptions for a diverse set of inputs with visual abstractions, novel classes, and fine-grained details.
We summarize the visual question answering evaluations in the table below (best in bold, second best is underlined). Notably, BRAVE obtains the best performance for six out of the seven evaluations, and second best for VQAv2. Similar to the captioning, this is achieved while having a lower sample complexity than PaLI-17B. Furthermore, BRAVE improves the performance for benchmarks measuring robustness issues of VLMs such as MMVP and POPE by leveraging diverse visual features.
Method |
# params | Fine-tuned | Zero-shot | ||||||
---|---|---|---|---|---|---|---|---|---|
Trainable | Total | VQAv2 test-dev |
OKVQA val |
GQA test-dev |
VizWiz-QA test-dev |
GQA test-dev |
MMVP test |
POPE test |
|
SimVLM | 632M | 632M | 80.0 | - | - | - | - | - | - |
Flamingo | 10.2B | 80B | 82.0 | 57.8 | - | 31.6 | - | - | - |
MiniGPT-v2 | 7B | 8B | - | 57.8 | 60.1 | 53.6 | - | - | - |
GiT2 | 5.1B | 5.1B | 81.7 | - | - | - | - | - | - |
Qwen-VL | 9.6B | 9.6B | 79.5 | 58.6 | 59.3 | 35.2 | - | - | - |
SPHINX-2k | 13B | 16.5B | 80.7 | 62.6 | 63.1 | 44.9 | - | - | 87.2 |
PaLI-17B | 16.9B | 16.9B | 84.3 | 64.5 | - | - | - | - | - |
BLIP-2 | 1.2B | 12.1B | 81.6 | 54.7 | - | 29.4 | 44.7 | - | 85.3 |
InstructBLIP | 188M | 14.2B | - | 55.5 | - | 33.4 | 49.5 | 16.7 | 78.9 |
LLaVA-1.5 | 13B | 13.4B | 80.0 | - | 63.3 | 53.6 | - | 24.7 | 85.9 |
LLaVA-1.5 (I-MoF) | 13B | 13.6B | 79.3 | - | - | - | - | 31.3 | 86.7 |
BRAVE | 3B | 10.3B | 82.5 | 66.0 | 66.3 | 54.2 | 52.7 | 42.0 | 87.6 |
The figure below shows zero-shot VQA results on example pairs from the MMVP benchmark. BRAVE significantly improves performance for a broad set of challenging inputs compared to recent methods as well as single vision encoder based VLM baselines. On the other hand, some examples remain challenging for all VLMs, e.g. those require fine-grained text or scene understanding, which can benefit from incorporating additional visual biases that target them.
We study different aspects of BRAVE in the paper. This includes:
We thank Diego Martin Arroyo, Ferjad Naeem, Xingyi Zhou, Yannick Strümpler and Yongqin Xian for their help with the project.
@article{kar2024brave,
title={{BRAVE}: Broadening the visual encoding of vision-language models},
author={Kar, O{\u{g}}uzhan Fatih and Tonioni, Alessio and Poklukar, Petra and Kulshrestha, Achin and Zamir, Amir and Tombari, Federico},
journal={arXiv preprint arXiv:2404.07204},
year={2024}
}