We collect the most popular eighteen types of representative VLMs to assess as shown in Table 3. Due to the possible effects of model parameters, we divide models into ~7B and ~13B for a fair comparison. We conduct zero-shot setups for all VLMs across our EgoThink benchmark. Considering evaluating open-ended model generations is not a trivial problem, we propose to use GPT-4 as the automatic evaluator to better measure generated answers.
Reset | Average | Exist | Attr | Afford | Activity | Loc | Spatial | Count | Compar | Situtaed | Forecasting | Nav | Assist |
GPT-4V(ision) | 65.5 | 62.0 | 82.0 | 58.0 | 59.5 | 86.0 | 62.0 | 42.0 | 48.0 | 83.0 | 55.0 | 64.0 | 84.0 |
OpenFlamingo-7B | 27.2 | 16.0 | 55.0 | 37.0 | 15.0 | 34.0 | 34.0 | 21.0 | 40.0 | 21.0 | 31.0 | 11.0 | 11.0 |
BLIP-2-6.7B | 28.1 | 49.0 | 29.0 | 39.0 | 33.5 | 60.0 | 31.0 | 3.0 | 21.0 | 33.0 | 25.0 | 8.0 | 6.0 |
LLaVA-1.5-7B | 39.0 | 33.0 | 47.0 | 54.0 | 35.5 | 35.0 | 49.0 | 20.0 | 47.0 | 37.0 | 27.0 | 29.0 | 54.0 |
MiniGPT-4-7B | 40.6 | 50.0 | 56.0 | 46.0 | 39.0 | 55.0 | 49.0 | 14.0 | 48.0 | 31.0 | 41.5 | 14.0 | 44.0 |
InstructBLIP-7B | 42.4 | 50.0 | 33.0 | 45.0 | 47.5 | 77.0 | 38.0 | 18.0 | 43.0 | 67.0 | 40.5 | 19.0 | 31.0 |
LLaMA-Adapter-7B | 42.5 | 37.0 | 60.0 | 46.0 | 34.5 | 48.0 | 51.0 | 29.0 | 39.0 | 25.0 | 41.5 | 42.0 | 57.0 |
Otter-I-7B | 45.3 | 48.0 | 56.0 | 39.0 | 44.0 | 60.0 | 44.0 | 39.0 | 48.0 | 42.0 | 38.0 | 31.0 | 55.0 |
PandaGPT-7B | 46.2 | 40.0 | 56.0 | 41.0 | 37.0 | 61.0 | 52.0 | 19.0 | 52.0 | 53.0 | 43.0 | 39.0 | 61.0 |
mPLUG-owl-7B | 48.8 | 56.0 | 58.0 | 47.0 | 53.0 | 60.0 | 53.0 | 25.0 | 49.0 | 44.0 | 49.5 | 33.0 | 58.0 |
LLaVA-7B | 49.6 | 63.0 | 58.0 | 50.0 | 47.0 | 81.0 | 45.0 | 24.0 | 36.0 | 47.0 | 49.5 | 35.0 | 60.0 |
InstructBLIP-13B | 42.8 | 52.0 | 55.0 | 49.0 | 54.0 | 63.0 | 49.0 | 11.0 | 33.0 | 59.0 | 44.0 | 19.0 | 25.0 |
PandaGPT-13B | 43.1 | 35.0 | 52.0 | 41.0 | 40.5 | 68.0 | 31.0 | 32.0 | 40.0 | 47.0 | 45.5 | 16.0 | 69.0 |
LLaVA-13B-Vicuna | 46.4 | 54.0 | 62.0 | 52.0 | 46.0 | 53.0 | 46.0 | 26.0 | 44.0 | 29.0 | 44.0 | 35.0 | 66.0 |
BLIP-2-11B | 49.6 | 52.0 | 62.0 | 41.0 | 49.5 | 90.0 | 66.0 | 25.0 | 50.0 | 70.0 | 48.0 | 18.0 | 24.0 |
InstructBLIP-11B | 51.1 | 74.0 | 68.0 | 48.0 | 49.5 | 86.0 | 52.0 | 32.0 | 49.0 | 73.0 | 53.0 | 16.0 | 17.0 |
LLaVA-13B-Llama2 | 55.1 | 65.0 | 61.0 | 45.0 | 56.0 | 77.0 | 53.0 | 34.0 | 34.0 | 66.0 | 50.5 | 49.0 | 71.0 |
LLaVA-1.5-13B | 55.3 | 66.0 | 55.0 | 51.0 | 55.0 | 82.0 | 57.0 | 32.0 | 56.0 | 67.0 | 48.5 | 39.0 | 55.0 |
Table 4: Combined single-answer grading scores on zero-shot setups for various dimensions. The bold indicates the best performance while the underline indicates the second-best performance. Exist, Attr, Afford, Loc, Spatial, Count, Compar, Situated, Nav and Assist represent existence, attribute, affordance, location, spatial relationship, counting, comparison, situated reasoning, navigation, and assistance.