Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Follow our WeChat public account to discover the beauty of CV technology.

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

With the emergence of pioneering multimodal understanding projects such as OpenFlamingo, LLaVA, and MiniGPT-4, we have witnessed the birth of over a hundred innovative multimodal models and numerous evaluation datasets. Faced with the rapid expansion of this field, we recognize a challenge:

Different multimodal models often provide testing results on different evaluation sets, but so far, there has been no unified open-source evaluation framework that comprehensively covers these diverse models and evaluation sets.

To address this, the OpenCompass team has developed VLMEvalKit, a brand new open-source multimodal evaluation framework designed to provide reliable and reproducible evaluation results, helping the community more accurately compare the performance of different multimodal models on various tasks.

GitHub:

https://github.com/open-compass/VLMEvalKit

(Feel free to use it; click the original text at the end to go directly.)

Main Features

We summarize the main features of VLMEvalKit as follows:

1. Scope of Application:

The current VLMEvalKit is mainly applicable to the evaluation of image-text multimodal models. Based on the model’s capability range, it can support single pairs of image-text input or any number of interleaved image-text inputs. The following code demonstrates how to perform inference for single pairs of image-text or any interleaved image-text using VLMEvalKit:

from vlmeval.config import supported_VLM
model = supported_VLM['idefics_9b_instruct']()
# Inference for single pair of image-text based on VLMEvalKit
ret = model.generate('apple.jpg', 'What is in this image?')
# ret: "The image features a red apple with a leaf on it."
# Inference for any interleaved image-text based on VLMEvalKit
ret = model.interleave_generate([
    'apple.jpg', 'apple.jpg', 'How many apples are there in the provided images? '])
# ret: "There are two apples in the provided images."

2. Rich Model and Evaluation Set Support:

  1. Supports three mainstream multimodal API models: GPT-4v, GeminiPro, QwenVLPlus

  2. Supports over thirty open-source multimodal models including llava-v1.5, mPLUG-Owl2, XComposer, CogVLM, etc.

  3. Supports over ten open-source multimodal evaluation sets including MME, MMBench, SEEDBench, MMMU, etc.

  4. Conducted detailed evaluations based on supported models and evaluation sets, with results published on the OpenCompass multimodal overall leaderboard: https://opencompass.org.cn/leaderboard-multimodal

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

3. Convenient One-Stop Evaluation:

  1. For all datasets supported by VLMEvalKit, no manual data preprocessing is required

  2. Evaluation of multiple multimodal models and datasets can be completed with one command

4. Easy to Extend: Based on the VLMEvalKit framework, you can easily add new multimodal models/evaluation sets. Moreover, once you have added a model/evaluation set, any existing evaluation sets/models can be applied to the evaluation of the new model/evaluation set.

  1. Add New Evaluation Sets: Basically, you just need to convert your custom evaluation set into the TSV format supported by VLMEvalKit and provide the corresponding custom prompt construction method to add a new evaluation set in VLMEvalKit. [AI2D](https://github.com/open-compass/VLMEvalKit/pull/51) provides a reference example.

  2. Add New Multimodal Models: To add a new multimodal model, you need to support a new class that only needs to support a simple generate(image, prompt) interface to meet the requirements. This method applies to both API models (QwenVLPlus, reference: https://github.com/open-compass/VLMEvalKit/pull/27/) and open-source models (Monkey, reference: https://github.com/open-compass/VLMEvalKit/pull/45).

  3. Select Custom Prompts for Different Evaluation Sets: We understand that developers may choose different prompt templates for different evaluation sets to achieve the best results, so we have supported this functionality in VLMEvalKit.

How to Use

Installation

git clone [email protected]:open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

Demo Check if Installation is Successful

from vlmeval.config import supported_VLM
model = supported_VLM['idefics_9b_instruct']()
ret = model.generate('apple.jpg', 'What is in this image?')
# ret: "The image features a red apple with a leaf on it."
ret = model.interleave_generate([
    'apple.jpg', 'apple.jpg', 'How many apples are there in the provided images? '])
# ret: "There are two apples in the provided images."

Conduct Evaluation

# Model: qwen_chat; Evaluation Set Range: MME; Machine Configuration: 2 A100 cards
torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
# Model: IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2
# Evaluation Set Range: MMBench_DEV_EN, MME, SEEDBench_IMG
# Machine Configuration: 8 A100 cards
torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG \ 
         --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose

Evaluation Results

We publish the test results on the OpenCompass multimodal large model performance overall leaderboard: https://opencompass.org.cn/leaderboard-multimodal. Currently, the leaderboard includes the performance of all multimodal models in VLMEvalKit on 9 evaluation sets. The following image captures part of the evaluation results:

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Quantitative Results

Overall, we have the following findings:

1. Closed-source multimodal API models still lead in overall performance: By calculating the average ranking of each model on different evaluation sets, we find that the top three ranked models, GeminiPro, GPT-4v, and QwenVLPlus, are all closed-source API models.

2. Open-source multimodal models lack inference capabilities: On some evaluation sets requiring strong inference capabilities (such as MMMU, MMVet, MathVista, etc.), open-source models (like InternLM-XComposer) still have a certain gap compared to closed-source models.

To facilitate users in comparing the performance of multimodal models, we selected 9 mainstream multimodal models for performance visualization:

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Qualitative Results

To understand the shortcomings of current multimodal models, we selected questions that all multimodal models could not answer correctly across the nine evaluation sets shown in the image above for visualization. Here are some results:

1. Questions requiring external knowledge to answer

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MathVista

Question: What is the age gap between these two people in the image? (Unit: years)

Answer: 11

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MMMU

Question: In the Section of left leg, identify the 170 structure.

Options: A. Tibialis anterior B. Tibialis posterior C. Flexor hallucis longus D. Peroneus longus

Answer: D

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MME

Question: Is the person inside the red bounding box called Michael Keaton? Please answer yes or no.

Answer: Yes

  1. Complex Multimodal Reasoning

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MathVista

Question: In the figure, AB=AC, ∠CAB=40°, then ∠D’s degree is ()

Options: (A) 40° (B) 50° (C) 60° (D) 70°

Answer: (D) 70°

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MathVista

Question: How many triangles do you see in the picture?

Answer: 12

3. Complex Chart Analysis

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MathVista

Question: What is the difference between genres of TV shows watched by the highest female and the lowest female?

Answer: 39

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Source: MMMU

Question: Each of the following situations relates to a different company. For company B, find the missing amounts.

Options: A. $63,020 B. $58,410 C. $71,320 D. $77,490

Answer: D

Appendix

VLMEvalKit Project Address:

https://github.com/open-compass/VLMEvalKit

MMBench Performance Leaderboard:

https://mmbench.opencompass.org.cn/leaderboard

OpenCompass Multimodal Large Model Performance Overall Leaderboard:

https://opencompass.org.cn/leaderboard-multimodal

How to Join the MMBench Performance Leaderboard:

  • Send prediction results on the MMBench evaluation set or the official evaluation ID to [email protected];

  • The official evaluation ID can be obtained by submitting prediction results to the evaluation service (https://mmbench.opencompass.org.cn/mmbench-submission).

How to Join the Overall Leaderboard of Multimodal Large Model Performance:

Submit a PR to the VLMEvalKit project, and the leaderboard will be updated accordingly. Reference:

  • [Support New Models] Support Monkey (#45):

    https://github.com/open-compass/VLMEvalKit/pull/45/files

  • [Support New Datasets] Support AI2D (#51):

    https://github.com/open-compass/VLMEvalKit/pull/51/files

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

END

Welcome to join the “Large Model” exchange group 👇 Please note:LLM

Who Is the Strongest Multimodal Model? VLMEvalKit Reveals Multimodal Capabilities

Leave a Comment