Logo EgoThink

Can Vision-Language Models Think from a First-Person Perspective?

Sijie Cheng*โ€ ,1,2,3, Zhicheng Guo*1,2,3, Jingwen Wu*4, Kechen Fang5,
Peng Liโœ‰,2, Huaping Liu1,3, Yang Liuโœ‰,1,2,3,

1Department of Computer Science and Technology, Tsinghua University
2Institute for AI Industry Research (AIR), Tsinghua University
3Beijing National Research Center for Information Science and Technology
4Department of Electrical and Computer Engineering, University of Toronto
5Zhili College, Tsinghua University

*Equal contribution, โœ‰Corresponding author
โ€ Project Lead: csj23@mails.tsinghua.edu.cn
geometric reasoning

Figure 1: The main categories of our EgoThink benchmark to comprehensively assess the capability of thinking from a first-person perspective.

๐Ÿ””News

[2023-11-27]: Our paper Can Vision-Language Models Think from a First-Person Perspective? has been released.

Abstraction

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

Logo EgoThink Benchmark

Overview

We specifically design six categories with twelve fine-grained dimensions from the first-person perspective for quantitative evaluation.

algebraic reasoning

Figure 2: Categories with fine-grained dimensions and their corresponding examples of EgoThink benchmark.

  • Object: What is around me? Recognizing objects in the real world is a preliminary ability of the human visual system. Images from a first-person or egocentric perspective pay more attention to the objects surrounding the subject or in hands. Moreover, we further divide the object category into three fine-grained dimensions: (1) Existence, predicting whether there is an object as described in the images; (2) Attribute, detecting properties or characteristics (e.g., color) of an object; (3) Affordance, predicting potential actions that a human can apply to an object.
  • Activity: What am I doing? Activity recognition is to automatically recognize specific human activities in video frames or still images. From the egocentric perspective, we mainly focus on actions or activities based on object-hand interaction.
  • Localization: Where am I? In reality, localization is a critical capability for navigation and scene understanding in the real world. Here we investigate the localization capability from two aspects, Location and Spatial Relationship. Location indicates detecting the scene surrounding the subject. Spatial reasoning contains allocentric and egocentric perspectives. We focus on the egocentric perspective, i.e., the position of the object with respect to the subject.
  • Reasoning: What about the situation around me? During the complex decision-making process, reasoning lies everywhere in our lives. Here we mainly focus on Counting, Comparison, and Situated Reasoning. Due to the first-person perspective, we generally count or compare objects in our hands or surrounding ourselves. As for situated reasoning, we employ cases that cannot be answered directly from the information in the images and require further reasoning processes.
  • Forecasting: What will happen to me? Forecasting is a critical skill in the real world. From an egocentric view, forecasting always predicts the future of object-state transformation or hand-object interactions.
  • Planning: How will I do? In reality, planning is an important capability to deal with complex problems, typically applied in Navigation and Assistance. Navigation is going to a goal location from a start position, while assistance is offering instructions to solve daily problems.

Comparisons with Existing Benchmarks

The ability to think from a first-person perspective is not adequately addressed by current evaluation benchmarks for VLMs. On one hand, most of these benchmarks (six out of nine, as listed in Table 1) focus solely on the third-person perspective. On the other hand, those benchmarks that do consider the first-person perspective only encompass a limited range of capabilities. For instance, EgoTaskQA examines spatial, temporal, and causal aspects, whereas EgoVQA is limited to object, action, and person aspects. Therefore, there is a clear need to develop a comprehensive benchmark to evaluate the first-person capabilities of VLMs more effectively.

algebraic reasoning

Table 1: Comparison of recent comprehensive evaluation benchmarks of VLMs and our proposed benchmark EgoThink. Third and first indicate third-person and first-person perspectives. Datasets/Handcraft/LLMs denote existing datasets, manual annotation, and automatic generation by LLMs. PS/MC/OE indicate pairwise scoring, multi-choice, and open-ended question-answering, respectively.

Statistics

algebraic reasoning

Table 2: Statistics of six categories with twelve dimensions in our EgoThink benchmark, where spatial* indicates spatial relationship and situated* indicates situated reasoning.

arithmetic reasoning

Figure 3: This chart illustrates the distribution of various scene categories within the EgoThink dataset. The "others" category encompasses 13 different scene types, each representing less than one percent of total scenes.

Experiment Results

Vision-Language Models

algebraic reasoning

Table 3: Statistics of compared API-based and open-source VLMs, where TTP and ToP indicate Total Trainable Parameters and Total Parameters, respectively. Moreover, EgoData and Video indicate that there are egocentric visual data and video data for training, respectively.

Leaderboard

We collect the most popular eighteen types of representative VLMs to assess as shown in Table 3. Due to the possible effects of model parameters, we divide models into ~7B and ~13B for a fair comparison. We conduct zero-shot setups for all VLMs across our EgoThink benchmark. Considering evaluating open-ended model generations is not a trivial problem, we propose to use GPT-4 as the automatic evaluator to better measure generated answers.

Reset Average Exist Attr Afford Activity Loc Spatial Count Compar Situtaed Forecasting Nav Assist
GPT-4V(ision) 65.5 62.0 82.0 58.0 59.5 86.0 62.0 42.0 48.0 83.0 55.0 64.0 84.0
OpenFlamingo-7B 27.2 16.0 55.0 37.0 15.0 34.0 34.0 21.0 40.0 21.0 31.0 11.0 11.0
BLIP-2-6.7B 28.1 49.0 29.0 39.0 33.5 60.0 31.0 3.0 21.0 33.0 25.0 8.0 6.0
LLaVA-1.5-7B 39.0 33.0 47.0 54.0 35.5 35.0 49.0 20.0 47.0 37.0 27.0 29.0 54.0
MiniGPT-4-7B 40.6 50.0 56.0 46.0 39.0 55.0 49.0 14.0 48.0 31.0 41.5 14.0 44.0
InstructBLIP-7B 42.4 50.0 33.0 45.0 47.5 77.0 38.0 18.0 43.0 67.0 40.5 19.0 31.0
LLaMA-Adapter-7B 42.5 37.0 60.0 46.0 34.5 48.0 51.0 29.0 39.0 25.0 41.5 42.0 57.0
Otter-I-7B 45.3 48.0 56.0 39.0 44.0 60.0 44.0 39.0 48.0 42.0 38.0 31.0 55.0
PandaGPT-7B 46.2 40.0 56.0 41.0 37.0 61.0 52.0 19.0 52.0 53.0 43.0 39.0 61.0
mPLUG-owl-7B 48.8 56.0 58.0 47.0 53.0 60.0 53.0 25.0 49.0 44.0 49.5 33.0 58.0
LLaVA-7B 49.6 63.0 58.0 50.0 47.0 81.0 45.0 24.0 36.0 47.0 49.5 35.0 60.0
InstructBLIP-13B 42.8 52.0 55.0 49.0 54.0 63.0 49.0 11.0 33.0 59.0 44.0 19.0 25.0
PandaGPT-13B 43.1 35.0 52.0 41.0 40.5 68.0 31.0 32.0 40.0 47.0 45.5 16.0 69.0
LLaVA-13B-Vicuna 46.4 54.0 62.0 52.0 46.0 53.0 46.0 26.0 44.0 29.0 44.0 35.0 66.0
BLIP-2-11B 49.6 52.0 62.0 41.0 49.5 90.0 66.0 25.0 50.0 70.0 48.0 18.0 24.0
InstructBLIP-11B 51.1 74.0 68.0 48.0 49.5 86.0 52.0 32.0 49.0 73.0 53.0 16.0 17.0
LLaVA-13B-Llama2 55.1 65.0 61.0 45.0 56.0 77.0 53.0 34.0 34.0 66.0 50.5 49.0 71.0
LLaVA-1.5-13B 55.3 66.0 55.0 51.0 55.0 82.0 57.0 32.0 56.0 67.0 48.5 39.0 55.0

Table 4: Combined single-answer grading scores on zero-shot setups for various dimensions. The bold indicates the best performance while the underline indicates the second-best performance. Exist, Attr, Afford, Loc, Spatial, Count, Compar, Situated, Nav and Assist represent existence, attribute, affordance, location, spatial relationship, counting, comparison, situated reasoning, navigation, and assistance.

Examples

BibTeX


      @article{cheng2023can,
        title={Can Vision-Language Models Think from a First-Person Perspective?},
        author={Cheng, Sijie and Guo, Zhicheng and Wu, Jingwen and Fang, Kechen and Li, Peng and Liu, Huaping and Liu, Yang},
        journal={arXiv preprint arXiv:2311.15596},
        year={2023}
      }