VidEgoThink

Assessing Egocentric Video Understanding Capabilities for Embodied AI

Sijie Cheng^†,1,2,6, Kechen Fang*^2,5, Yangyang Yu*^2,5, Sicheng Zhou*^2,3,
Bohao Li^4,6, Ye Tian⁶, Tingguang Li⁶, Lei Han^✉,6, Yang Liu^✉,1,2,

¹Department of Computer Science and Technology, Tsinghua University
²Institute for AI Industry Research (AIR), Tsinghua University
³Department of Mechanical and Industrial Engineering, University of Toronto
⁴School of Data Science, The Chinese University of HongKong
⁵Zhili College, Tsinghua University ⁶Tencent Robotics X

*Equal contribution, ✉Corresponding author
†Project Lead: csj23@mails.tsinghua.edu.cn

arXiv

🤗

HF Paper Code

🏆

Leaderboard

Figure 1: The main tasks of VidEgoThink benchmark to comprehensively assess the egocentric video understanding capabilities in Embodied AI. There are four types of tasks, including video question answering, hierarchy planning, visual grounding, and reward modeling. These four tasks are complementary to each other to implement a complete goal for Embodied AI.

🔔News

[2024-10]: VidEgoThink is the Top-1 paper of Oct-17 in Hugging Face. 🔥
[2024-10]: Our paper VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI has been released.
[2024-09]: EgoThink and VidEgoThink is invited to be presented in ZhiDX.

Abstraction

Recent advancements in Multi-modal Large Language Models (MLLMs) have opened new avenues for applications in Embodied AI. Building on previous work, EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating egocentric video understanding capabilities. To bridge the gap between MLLMs and low-level control in Embodied AI, we design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. To minimize manual annotation costs, we develop an automatic data generation pipeline based on the Ego4D dataset, leveraging the prior knowledge and multimodal capabilities of GPT-4o. Three human annotators then filter the generated data to ensure diversity and quality, resulting in the VidEgoThink benchmark. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs. Experimental results indicate that all MLLMs, including GPT-4o, perform poorly across all tasks related to egocentric video understanding. These findings suggest that foundation models still require significant advancements to be effectively applied to first-person scenarios in Embodied AI. In conclusion, VidEgoThink reflects a research trend towards employing MLLMs for egocentric vision, akin to human capabilities, enabling active observation and interaction in the complex real-world environments.

Background

Various egocentric benchmarks have emerged to evaluate the capabilities of MLLMs from a first-person perspective. However, the absence of a comprehensive video benchmark from the egocentric perspective presents a significant challenge to the development of general foundation models. Furthermore, current benchmarks, both in task design and textual output forms, focus on traditional video question-answering settings and neglect the potential to support downstream applications in Embodied AI, such as glass devices or autonomous robots. Therefore, it is crucial to design suitable task formats that can be effectively applied to downstream applications in Embodied AI.

Table 1: omparison of recent evaluation benchmarks of multimodal large language models and our proposed benchmark VidEgoThink. VQA/HP/VG/RM indicate visual question answering, hierarchy planning, visual grounding, and reward modeling. Existing/Handcraft/Automatic denote the way of collecting data, including existing dataset, manual annotation, and automatic generation.

VidEgoThink Benchmark

Given that the utilization of foundation models in Embodied AI remains an open research question, we carefully design four types of interrelated tasks for comprehensive assessment: (i) video question-answering, (ii) hierarchy planning, (iii) visual grounding, (iv) reward modeling.

1. Video Question Answering

Previous evaluation studies on egocentric vision have predominantly focused on static images, constrained by the input format limitations of earlier MLLMs. However, recent advancements in API-based and video-based MLLMs have demonstrated significant progress. Since our real world is inherently dynamic and humans frequently process substantial amounts of video data, it is crucial to evaluate the video understanding capabilities of MLLMs. Considering the essential abilities for observing and interacting with the real world from a first-person perspective, we decompose the content of video modalities around “myself” into three main elements: object, action, and scene. Furthermore, we explore a series of fine-grained dimensions from these elements.

Figure 2: Case of video question answering.

2. Hierarchy Planning

Recently, a hierarchy planning framework has been proposed to combine the advantages of foundation models and traditional methods in Embodied AI. In detail, foundation models are used as the planner to decompose high-level task instructions (e.g., “cook salmon”) into either mid-level steps (e.g., “# put salmon in the microwave’) or low-level atomic actions (e.g., “find(microwave)”), which is much more convenient for controlling. Therefore, we design two types of planning tasks: high-level goal to mid-level step (High-to-Mid), and mid-level step to low-level action (Mid-to-Low).

Figure 3: Case of hierarchy planning.

3. Visual Grounding

While natural language is effective for human communication, it cannot be directly translated into low-level actions or grounded in the real world. Consequently, visual grounding has garnered significant attention in both image- and video-based MLLMs. This task requires models to ground complex natural language descriptions or instructions in an image or video and output the corresponding pixel-level bounding boxes, masks, or frames. The bounding boxes and masks can directly identify actionable objects, while the frames can provide sufficient spatial or temporal information for downstream tasks. Therefore, we specifically design three tasks for different situations: object grounding, frame grounding, and temporal grounding.

Figure 4: Case of visual grounding.

4. Reward Modeling

In Embodied AI, manually designing reward functions to supervise actions is challenging due to the need for accuracy and diversity, especially for human activities. Benefiting from the large-scale Internet training corpus, foundation models can serve as reward models with built-in commonsense and reasoning capabilities. As a reward model, MLLMs should first observe the video to determine the completion status of the target motion. If the action is not completed, the reward model should further provide fine-grained feedback to help achieve the goal. Hence, we specifically design two types of tasks: critique, and feedback.

Figure 5: Case of reward modeling.

Statistics

Table 2: The statistics of videos across different benchmarks. Duration denotes the average time duration in second of all videos. LenQ and LenA indicate that the average length of questions and answers in the word level. TypeQ denotes the type of questions.

Leaderboard

Video Question Answering

Table 3: Experimental results of video question answering. OE, OO, OI, OC, OS, OP denote object existence, object order, object interaction, object count, object state, object prediction. AE, AS, AC indicates action existence, action sequence, action count. SE, ST, SP denote scene existence, scene transition, scene prediction. The bold font denotes the best performance and the underline font denotes the second-best performance.

All Tasks

Table 4: Experimental results of video question answerng, hierarchy planning, visual grounding, and reward modeling tasks. The bold font denotes the best performance and the underline font denotes the second-best performance.

BibTeX


      @article{cheng2024videgothink,
        title={VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI},
        author={Cheng, Sijie and Fang, Kechen and Yu, Yangyang and Zhou, Sicheng and Li, Bohao and Tian, Ye and Li, Tingguang and Han, Lei and Liu, Yang},
        journal={arXiv preprint arXiv:2410.11623},
        year={2024}
      }