BRAVE: Broadening the visual encoding of vision-language models

BRAVE: Broadening the visual encoding of vision-language models

1Google 2Swiss Federal Institute of Technology in Lausanne (EPFL)
ECCV 2024 (Oral)

BRAVE pull figure.

We propose BRAVE to broaden the visual capabilities of vision-language models (VLMs). Left: In contrast to existing methods, e.g. InstructBLIP or LLaVA-1.5, that use a single vision encoder, BRAVE combines diverse features from multiple vision encoders into a more versatile and compact representation. The examples are taken from MMVP and assess the VLM's ability to differentiate images with visual differences. Right: BRAVE leads to state-of-the-art performance on a wide range of captioning and visual question answering tasks. Furthermore, it significantly improves the performance on benchmarks, e.g. MMVP, where commonly employed vision encoders, e.g. CLIP, fail.

Introductory ECCV video (5min) .

Summary

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening of the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.

Benchmarking vision encoders

To quantify the impact of visual biases on the performance of VLMs, we compare VLMs with different vision encoders on commonly evaluated VQA and captioning tasks. For this, we develop a pre-training, fine-tuning and evaluation setup, as explained in the paper. To the best of our knowledge, this is the first unified and comprehensive evaluation of different vision encoders for vision-language understanding and generation tasks.

We consider eight recently introduced vision encoders: CLIP, OpenCLIP, EVA-CLIP, SIGLIP, SILC, ViT-e, ViT-G, and DINOv2. While they all use ViT-based backbones, they differ in terms of 1) training data, e.g. LAION, WebLI, LVD-142M, etc., 2) training objective, e.g. image-text contrastive learning, masked image modelling, classification, etc., and 3) model size, e.g. 300M to 4B parameters. Due to this diversity, they incorporate different biases and thus potentially capture different aspects of the underlying depicted scene. The results are summarized in the table below.

Vision encoder COCO Cap. ↑
Karpathy val
VQAv2 ↑
Karpathy val
OKVQA ↑
val
GQA ↑
test-dev
MMVP ↑
test
CLIP-L/14 133.0 74.4 61.0 48.7 15.3
OpenCLIP-G 128.3 73.3 60.6 48.0 22.0
EVA-CLIP-g 140.9 77.0 63.0 50.1 27.3
SIGLIP-G/14 133.0 74.7 62.5 48.6 24.0
SILC-G/16 141.1 77.0 63.4 49.7 24.0
ViT-e 137.8 75.6 61.9 49.1 25.3
ViT-G 133.8 74.2 61.2 48.3 20.7
DINOv2-L/14 127.6 71.3 59.0 48.0 22.0

Benchmarking vision encoders on different vision-language tasks. For COCO captioning, we report CIDEr score. For VQA tasks, we report top-1 accuracy. For MMVP, we report average pair accuracy. Top-3 best results are shown with Blue, Green, and Orange, respectively.

While we defer the detailed results and takeaways to the paper for brevity, we note the following two key observations:

  • There is no single vision encoder that can perform consistently well for different captioning and visual question answering tasks, suggesting that using a single encoder in a VLM is inherently limited.
  • The encoders with different biases can perform surprisingly similarly, suggesting that there could be multiple cues that can be leveraged to solve the tasks.
Our findings motivated us to ask the following question:

Can we broaden the visual capabilities of VLMs through combining vision encoders with different biases?

We answer positively and introduce BRAVE to achieve this efficiently, as explained next.

BRAVE method overview

Motivated by the benchmarking results, we propose using multiple vision encoders with different, and potentially complementary, visual biases to create more capable VLMs. To do this, we introduce BRAVE, a method that combines the strengths of different vision encoders while staying efficient in terms of number of trainable parameters. Below, we provide a summary of how BRAVE works. Please see the paper for details.

As shown in the animation below, we introduce the multi-encoder querying transformer, or MEQ-Former, that combines visual features from an arbitrary set of encoders and outputs a fixed-length visual representation that is given as a soft prompt to a frozen LM. It takes as an input a sequence comprised of a fixed number of learnable queries and embedded text tokens which describe the task. MEQ-Former interacts with visual features through cross-attention layers. The visual features from different encoders are first linearly projected to the same dimension and then concatenated sequence-wise. The obtained concatenated feature is given as a key and value pair to the cross-attention layers in MEQ-Former, and is cross-attended by its query sequence. This resampling enables efficiently processing a large number of visual features since it bypasses the quadratic complexity of self-attention. It also act as a "bottleneck" that keeps the total number of VLM's parameters low compared to a naive ensembling of VLMs. For the main results, we combine five vision encoders, namely EVA-CLIP-g, CLIP-L/14, SILC-G/16, ViT-e, and DINOv2-L/14 to cover all training datasets and objectives from the benchmarking study.

Overview of BRAVE. We keep all the vision encoders (VEs) and the language model (LM) frozen. The linear projection layers are used to concatenate features from K different VEs, e.g. K=5, sequence-wise. These are then resampled by the multi-encoder querying transformer, i.e. MEQ-Former, which accepts a set of learnable queries and a text prompt describing the task as inputs. The output of MEQ-Former is projected to the input space of the LM using fully-connected (FC) layers.

Results

We perform evaluations on a broad range of captioning and VQA tasks to show the effectiveness of BRAVE. The figure below provides an overview of the tasks.

Task overview

Overview of the evaluation tasks. They evaluate different capabilities of VLMs, which is important to understand their strengths and weaknesses.

Captioning

We summarize the captioning evaluations in the table below (best in bold, second best is underlined). BRAVE uses the least number of trainable parameters (116M) yet achieves strong results for both COCO and NoCaps benchmarks. For NoCaps, BRAVE is the best performing method with significant gains over recent methods. This is especially the case for out-domain samples with novel classes, demonstrating the usefulness of diversity in visual features for robustness. For COCO, BRAVE stays competitive with the best performing model, PaLI-17B. Notably, BRAVE achieves this while using 150x fewer trainable parameters (116M vs 16.9B), 16x less pre-training data (100M vs 1.6B) and 3x less image pixels (336x336 vs 588x588) than PaLI-17B, suggesting that having different visual biases is effective for generalization, while keeping the sample complexity low.



Method
# params COCO (fine-tuned) NoCaps (zero-shot, val) NoCaps (zero-shot, test)
Trainable Total Karpathy test out-domain overall out-domain overall
Flamingo 10.6B 80B 138.1 - - - -
SimVLM 632M 632M 143.3 113.7 112.2 - 110.3
Qwen-VL 9.6B 9.6B - - 121.4 - -
BLIP-2 1.1B 4.1B 144.5 124.8 121.6 - -
InstructBLIP 188M 14.2B - - 121.9 - -
CoCa 2.1B 2.1B 143.6 - 122.4 - 120.6
GiT2 5.1B 5.1B 145.0 130.6 126.9 122.3 124.8
PaLI-17B 16.9B 16.9B 149.1 - 127.0 126.7 124.4
BRAVE 116M 10.3B 148.0 133.3 127.6 127.1 125.6

The figure below shows zero-shot captioning results on NoCaps validation set images. BRAVE outputs accurate descriptions for a diverse set of inputs with visual abstractions, novel classes, and fine-grained details.

Captioning results

Zero-shot captioning on samples from NoCaps validation set using BRAVE.

Visual Question Answering

We summarize the visual question answering evaluations in the table below (best in bold, second best is underlined). Notably, BRAVE obtains the best performance for six out of the seven evaluations, and second best for VQAv2. Similar to the captioning, this is achieved while having a lower sample complexity than PaLI-17B. Furthermore, BRAVE improves the performance for benchmarks measuring robustness issues of VLMs such as MMVP and POPE by leveraging diverse visual features.



Method
# params Fine-tuned Zero-shot
Trainable Total VQAv2
test-dev
OKVQA
val
GQA
test-dev
VizWiz-QA
test-dev
GQA
test-dev
MMVP
test
POPE
test
SimVLM 632M 632M 80.0 - - - - - -
Flamingo 10.2B 80B 82.0 57.8 - 31.6 - - -
MiniGPT-v2 7B 8B - 57.8 60.1 53.6 - - -
GiT2 5.1B 5.1B 81.7 - - - - - -
Qwen-VL 9.6B 9.6B 79.5 58.6 59.3 35.2 - - -
SPHINX-2k 13B 16.5B 80.7 62.6 63.1 44.9 - - 87.2
PaLI-17B 16.9B 16.9B 84.3 64.5 - - - - -
BLIP-2 1.2B 12.1B 81.6 54.7 - 29.4 44.7 - 85.3
InstructBLIP 188M 14.2B - 55.5 - 33.4 49.5 16.7 78.9
LLaVA-1.5 13B 13.4B 80.0 - 63.3 53.6 - 24.7 85.9
LLaVA-1.5 (I-MoF) 13B 13.6B 79.3 - - - - 31.3 86.7
BRAVE 3B 10.3B 82.5 66.0 66.3 54.2 52.7 42.0 87.6

The figure below shows zero-shot VQA results on example pairs from the MMVP benchmark. BRAVE significantly improves performance for a broad set of challenging inputs compared to recent methods as well as single vision encoder based VLM baselines. On the other hand, some examples remain challenging for all VLMs, e.g. those require fine-grained text or scene understanding, which can benefit from incorporating additional visual biases that target them.

VQA results

Zero-shot VQA on samples from MMVP benchmark. Following the evaluation protocol, a model is considered correct only if it answers to both images in a pair correctly, i.e. if it can successfully differentiate between images with semantic differences. Note that the images in a pair are seen independently, i.e. neither of the images is provided as context for the other one.

Analysis

We study different aspects of BRAVE in the paper. This includes:

  • A comprehensive study that ablates different design choices, e.g. fine-tuning, language model, training resolution, text conditioning, etc.
  • Contribution of vision encoders to the final performance
  • Robustness to missing encoders
  • Role of pre-training data
We refer the interested readers to the paper and supplementary material for details.


Acknowledgements

We thank Diego Martin Arroyo, Ferjad Naeem, Xingyi Zhou, Yannick Strümpler and Yongqin Xian for their help with the project.

BibTeX

@article{kar2024brave,
        title={{BRAVE}: Broadening the visual encoding of vision-language models},
        author={Kar, O{\u{g}}uzhan Fatih and Tonioni, Alessio and Poklukar, Petra and Kulshrestha, Achin and Zamir, Amir and Tombari, Federico},
        journal={arXiv preprint arXiv:2404.07204},
        year={2024}
      }