EgoThink

🔔News

[2024-10]: Our related paper VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI has been released.
[2024-09]: EgoThink and VidEgoThink is invited to be presented in ZhiDX.
[2024-04]: EgoThink is invited to be presented in ByteDance.
[2024-04]: EgoThink will be presented as a Poster (Highlight👀) in CVPR 2024.
[2024-03]: EgoThink is presented in AITIME.
[2024-02]: EgoThink has been accepted by CVPR 2024.
[2023-11]: Our paper Can Vision-Language Models Think from a First-Person Perspective? has been released.

Abstraction

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

Overview

We specifically design six categories with twelve fine-grained dimensions from the first-person perspective for quantitative evaluation.

Figure 2: Categories with fine-grained dimensions and their corresponding examples of EgoThink benchmark.

Object: What is around me? Recognizing objects in the real world is a preliminary ability of the human visual system. Images from a first-person or egocentric perspective pay more attention to the objects surrounding the subject or in hands. Moreover, we further divide the object category into three fine-grained dimensions: (1) Existence, predicting whether there is an object as described in the images; (2) Attribute, detecting properties or characteristics (e.g., color) of an object; (3) Affordance, predicting potential actions that a human can apply to an object.
Activity: What am I doing? Activity recognition is to automatically recognize specific human activities in video frames or still images. From the egocentric perspective, we mainly focus on actions or activities based on object-hand interaction.
Localization: Where am I? In reality, localization is a critical capability for navigation and scene understanding in the real world. Here we investigate the localization capability from two aspects, Location and Spatial Relationship. Location indicates detecting the scene surrounding the subject. Spatial reasoning contains allocentric and egocentric perspectives. We focus on the egocentric perspective, i.e., the position of the object with respect to the subject.
Reasoning: What about the situation around me? During the complex decision-making process, reasoning lies everywhere in our lives. Here we mainly focus on Counting, Comparison, and Situated Reasoning. Due to the first-person perspective, we generally count or compare objects in our hands or surrounding ourselves. As for situated reasoning, we employ cases that cannot be answered directly from the information in the images and require further reasoning processes.
Forecasting: What will happen to me? Forecasting is a critical skill in the real world. From an egocentric view, forecasting always predicts the future of object-state transformation or hand-object interactions.
Planning: How will I do? In reality, planning is an important capability to deal with complex problems, typically applied in Navigation and Assistance. Navigation is going to a goal location from a start position, while assistance is offering instructions to solve daily problems.

Comparisons with Existing Benchmarks

The ability to think from a first-person perspective is not adequately addressed by current evaluation benchmarks for VLMs. On one hand, most of these benchmarks (six out of nine, as listed in Table 1) focus solely on the third-person perspective. On the other hand, those benchmarks that do consider the first-person perspective only encompass a limited range of capabilities. For instance, EgoTaskQA examines spatial, temporal, and causal aspects, whereas EgoVQA is limited to object, action, and person aspects. Therefore, there is a clear need to develop a comprehensive benchmark to evaluate the first-person capabilities of VLMs more effectively.

Table 1: Comparison of recent comprehensive evaluation benchmarks of VLMs and our proposed benchmark EgoThink. Third and first indicate third-person and first-person perspectives. Datasets/Handcraft/LLMs denote existing datasets, manual annotation, and automatic generation by LLMs. PS/MC/OE indicate pairwise scoring, multi-choice, and open-ended question-answering, respectively.

Statistics

Table 2: Statistics of six categories with twelve dimensions in our EgoThink benchmark, where spatial* indicates spatial relationship and situated* indicates situated reasoning.

Figure 3: This chart illustrates the distribution of various scene categories within the EgoThink dataset. The "others" category encompasses 13 different scene types, each representing less than one percent of total scenes.

Evaluating open-ended model generations is a non-trivial problem. To address this, we propose using GPT-4 as an automatic evaluator to better measure the generated answers. We continuously update the results of recent VLMs to ensure the effectiveness of EgoThink. Feel free to contribute to the performance of your model by adding it to our index.html; we will review and merge it accordingly.

Reset	Average	Exist	Attr	Afford	Activity	Loc	Spatial	Count	Compar	Situtaed	Forecasting	Nav	Assist
GPT-4V(ision)	65.5	62.0	82.0	58.0	59.5	86.0	62.0	42.0	48.0	83.0	55.0	64.0	84.0
OpenFlamingo-7B	27.2	16.0	55.0	37.0	15.0	34.0	34.0	21.0	40.0	21.0	31.0	11.0	11.0
BLIP-2-6.7B	28.1	49.0	29.0	39.0	33.5	60.0	31.0	3.0	21.0	33.0	25.0	8.0	6.0
LLaVA-1.5-7B	39.0	33.0	47.0	54.0	35.5	35.0	49.0	20.0	47.0	37.0	27.0	29.0	54.0
MiniGPT-4-7B	40.6	50.0	56.0	46.0	39.0	55.0	49.0	14.0	48.0	31.0	41.5	14.0	44.0
InstructBLIP-7B	42.4	50.0	33.0	45.0	47.5	77.0	38.0	18.0	43.0	67.0	40.5	19.0	31.0
LLaMA-Adapter-7B	42.5	37.0	60.0	46.0	34.5	48.0	51.0	29.0	39.0	25.0	41.5	42.0	57.0
Otter-I-7B	45.3	48.0	56.0	39.0	44.0	60.0	44.0	39.0	48.0	42.0	38.0	31.0	55.0
PandaGPT-7B	46.2	40.0	56.0	41.0	37.0	61.0	52.0	19.0	52.0	53.0	43.0	39.0	61.0
mPLUG-owl-7B	48.8	56.0	58.0	47.0	53.0	60.0	53.0	25.0	49.0	44.0	49.5	33.0	58.0
LLaVA-7B	49.6	63.0	58.0	50.0	47.0	81.0	45.0	24.0	36.0	47.0	49.5	35.0	60.0
InstructBLIP-13B	42.8	52.0	55.0	49.0	54.0	63.0	49.0	11.0	33.0	59.0	44.0	19.0	25.0
PandaGPT-13B	43.1	35.0	52.0	41.0	40.5	68.0	31.0	32.0	40.0	47.0	45.5	16.0	69.0
LLaVA-13B-Vicuna	46.4	54.0	62.0	52.0	46.0	53.0	46.0	26.0	44.0	29.0	44.0	35.0	66.0
BLIP-2-11B	49.6	52.0	62.0	41.0	49.5	90.0	66.0	25.0	50.0	70.0	48.0	18.0	24.0
InstructBLIP-11B	51.1	74.0	68.0	48.0	49.5	86.0	52.0	32.0	49.0	73.0	53.0	16.0	17.0
LLaVA-13B-Llama2	55.1	65.0	61.0	45.0	56.0	77.0	53.0	34.0	34.0	66.0	50.5	49.0	71.0
LLaVA-1.5-13B	55.3	66.0	55.0	51.0	55.0	82.0	57.0	32.0	56.0	67.0	48.5	39.0	55.0

Table 4: Combined single-answer grading scores on zero-shot setups for various dimensions. The bold indicates the best performance while the underline indicates the second-best performance. Exist, Attr, Afford, Loc, Spatial, Count, Compar, Situated, Nav and Assist represent existence, attribute, affordance, location, spatial relationship, counting, comparison, situated reasoning, navigation, and assistance.

BibTeX


      @InProceedings{Cheng_2024_CVPR,
        author    = {Cheng, Sijie and Guo, Zhicheng and Wu, Jingwen and Fang, Kechen and Li, Peng and Liu, Huaping and Liu, Yang},
        title     = {EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2024},
        pages     = {14291-14302}
    }