[2024-10]: VidEgoThink is the Top-1 paper of Oct-17 in Hugging Face. 🔥
[2024-10]: Our paper VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI has been released.
[2024-09]: EgoThink and VidEgoThink is invited to be presented in ZhiDX.
Recent advancements in Multi-modal Large Language Models (MLLMs) have opened new avenues for applications in Embodied AI. Building on previous work, EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating egocentric video understanding capabilities. To bridge the gap between MLLMs and low-level control in Embodied AI, we design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. To minimize manual annotation costs, we develop an automatic data generation pipeline based on the Ego4D dataset, leveraging the prior knowledge and multimodal capabilities of GPT-4o. Three human annotators then filter the generated data to ensure diversity and quality, resulting in the VidEgoThink benchmark. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs. Experimental results indicate that all MLLMs, including GPT-4o, perform poorly across all tasks related to egocentric video understanding. These findings suggest that foundation models still require significant advancements to be effectively applied to first-person scenarios in Embodied AI. In conclusion, VidEgoThink reflects a research trend towards employing MLLMs for egocentric vision, akin to human capabilities, enabling active observation and interaction in the complex real-world environments.
Various egocentric benchmarks have emerged to evaluate the capabilities of MLLMs from a first-person perspective. However, the absence of a comprehensive video benchmark from the egocentric perspective presents a significant challenge to the development of general foundation models. Furthermore, current benchmarks, both in task design and textual output forms, focus on traditional video question-answering settings and neglect the potential to support downstream applications in Embodied AI, such as glass devices or autonomous robots. Therefore, it is crucial to design suitable task formats that can be effectively applied to downstream applications in Embodied AI.
Table 1: omparison of recent evaluation benchmarks of multimodal large language models and our proposed benchmark VidEgoThink. VQA/HP/VG/RM indicate visual question answering, hierarchy planning, visual grounding, and reward modeling. Existing/Handcraft/Automatic denote the way of collecting data, including existing dataset, manual annotation, and automatic generation.
Given that the utilization of foundation models in Embodied AI remains an open research question, we carefully design four types of interrelated tasks for comprehensive assessment: (i) video question-answering, (ii) hierarchy planning, (iii) visual grounding, (iv) reward modeling.
Previous evaluation studies on egocentric vision have predominantly focused on static images, constrained by the input format limitations of earlier MLLMs. However, recent advancements in API-based and video-based MLLMs have demonstrated significant progress. Since our real world is inherently dynamic and humans frequently process substantial amounts of video data, it is crucial to evaluate the video understanding capabilities of MLLMs. Considering the essential abilities for observing and interacting with the real world from a first-person perspective, we decompose the content of video modalities around “myself” into three main elements: object, action, and scene. Furthermore, we explore a series of fine-grained dimensions from these elements.
Figure 2: Case of video question answering.
Recently, a hierarchy planning framework has been proposed to combine the advantages of foundation models and traditional methods in Embodied AI. In detail, foundation models are used as the planner to decompose high-level task instructions (e.g., “cook salmon”) into either mid-level steps (e.g., “# put salmon in the microwave’) or low-level atomic actions (e.g., “find(microwave)”), which is much more convenient for controlling. Therefore, we design two types of planning tasks: high-level goal to mid-level step (High-to-Mid), and mid-level step to low-level action (Mid-to-Low).
Figure 3: Case of hierarchy planning.
While natural language is effective for human communication, it cannot be directly translated into low-level actions or grounded in the real world. Consequently, visual grounding has garnered significant attention in both image- and video-based MLLMs. This task requires models to ground complex natural language descriptions or instructions in an image or video and output the corresponding pixel-level bounding boxes, masks, or frames. The bounding boxes and masks can directly identify actionable objects, while the frames can provide sufficient spatial or temporal information for downstream tasks. Therefore, we specifically design three tasks for different situations: object grounding, frame grounding, and temporal grounding.
Figure 4: Case of visual grounding.
In Embodied AI, manually designing reward functions to supervise actions is challenging due to the need for accuracy and diversity, especially for human activities. Benefiting from the large-scale Internet training corpus, foundation models can serve as reward models with built-in commonsense and reasoning capabilities. As a reward model, MLLMs should first observe the video to determine the completion status of the target motion. If the action is not completed, the reward model should further provide fine-grained feedback to help achieve the goal. Hence, we specifically design two types of tasks: critique, and feedback.
Figure 5: Case of reward modeling.
Table 2: The statistics of videos across different benchmarks. Duration denotes the average time duration in second of all videos. LenQ and LenA indicate that the average length of questions and answers in the word level. TypeQ denotes the type of questions.
Table 3: Experimental results of video question answering. OE, OO, OI, OC, OS, OP denote object existence, object order, object interaction, object count, object state, object prediction. AE, AS, AC indicates action existence, action sequence, action count. SE, ST, SP denote scene existence, scene transition, scene prediction. The bold font denotes the best performance and the underline font denotes the second-best performance.
Table 4: Experimental results of video question answerng, hierarchy planning, visual grounding, and reward modeling tasks. The bold font denotes the best performance and the underline font denotes the second-best performance.
@article{cheng2024videgothink,
title={VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI},
author={Cheng, Sijie and Fang, Kechen and Yu, Yangyang and Zhou, Sicheng and Li, Bohao and Tian, Ye and Li, Tingguang and Han, Lei and Liu, Yang},
journal={arXiv preprint arXiv:2410.11623},
year={2024}
}