Skip to the content.

‘What did the Robot do in my Absence?’
Video Foundation Models to Enhance Intermittent Supervision

Kavindie Katuwandeniya1, Leimin Tian1, and Dana Kulic2.

Preprint can be found in arxiv.

Abstract

This paper investigates the use of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries.

image Fig. 1: Diagram of the proposed robot summary generation system. The system generates generic or query-driven summaries of long egocentric robot videos in the form of storyboards, short videos, or text to help a user review the robots’ autonomous history.

Video


Analysis of Cause of Errors

We compared the ground truth answer, model generated answer and the answer given by the user to further understand the cause of errors and inform future research.

Assumptions for error analysis

Determining the correctness of the model’s answers was not always straightforward. The model’s inaccuracies can stem from various factors, including the complexity of the query and the inherent limitations of the model itself. Below we present assumptions we adopted during the error analysis with illustrative examples.

  1. Not querying the model.

    Some users did not explicitly query about some questions, instead they watched the given raw video or just answered ‘Not sure’. In this case, we use NaN as the model answer to that question as the model did not generate an answer. As a result, there are only 243 model answers for the Q condition instead of 270 (15 participants × 6 questions x 3 for each V, S and T tasks).

  2. Visual representation of the absence of an object.

    For questions where the ground truth answer was 0/No/No-one, when the summary modality is video or storyboard, since the number of video segments (K/k) or the number of images (M/m) were not changed dynamically, a summary video or a storyboard was always generated. However, this video/storyboard would not show any object/event instances since there were none in the raw video. In these cases, users can interpret the model output either as absence of an object, or as the model’s incapability - resulting in them saying ‘Not sure’ as a potential answer.

  3. Complex interplay between the user and the model and the ground truth.

    1. A mismatch between the correct answer and the answer provided by the model could be due to short-comings of the model or the query from the user. However, even if a user inputs a valid query and the model outputs a perfectly valid answer, it might not be enough to answer the question. For example, consider the following scenario in query-based summary via text modality (QT):
      • Question to the user: ‘In the first three-way junction you get inside the tunnel, which direction do you choose to explore?’
      • Query by the user: ‘What time does the main junction with three separate tunnels appear in the video’
      • Model Answer: ‘The main junction with three separate tunnels appears in the video at night’
      • User Answer: ‘Not sure’
      • Ground Truth Answer: ‘Middle’

      The query is reasonable, where the user is trying to narrow down which part of the video they should pay attention to. The answer from the model is also appropriate, but not answering the user’s query in the way they intended. So, it becomes unclear where the overall incorrectness should be assigned to.

    2. Mismatch between the answer provided by the model and the answer provided by the user could occur because the users might not trust the answer from the model or users might misinterpret the answer. For example, consider the following scenario under QT, under the object type questions:
      • Question to the user: ‘What objects did you observe in the following list and how many of each? Blue Rope Bundle , Fire Extinguisher, Drill’
      • Query by the user: ‘how many fire extinguisher and drills in the video?’
      • Model Answer: ‘There are a total of two fire extinguishers and drills shown in the video’
      • User Answer: ‘Fire Extinguisher = 2, Drill = 2’
      • Ground Truth Answer: ‘Fire Extinguisher = >2, Drill = 2’

      The user has selected 2 for fire extinguisher and 2 for drills. This is one way to interpret what the model said.

To conduct the analysis, we make the following assumptions (note that G = general summary, Q = query-base summary; V = summary with video modality, S = summary with story board modality, T = summary with text modality):

Error analysis results

Based on the above assumptions, the summary statistics in the following table is derived, where the percentage accuracy is shown (GT = Ground truth).

Model Answer = GT Answer
(513 data points)
User Answer = Model Answer
(513 data points)
User Answer = GT Answer
(540 data points)
63.0% 61.8% 46.5%

The models demonstrated an average 63.0 % accuracy rate. Without any AI assistance, users demonstrated an overall accuracy of 43.3 %. With assistance, the overall task accuracy of the users was 46.5 %, but varied across summarisation type and modality (see Tab. II of the paper).

Examples

Limitations of current ViFMS

Without any domain specific understanding, ViFMs have general understanding. For example under QS, when the query is `spot’ , the model produces images of spots (a small round or roundish mark) which is a general meaning, while ‘spot robot’ selects the robot images.

t

Challenges with free-form querying

Different queries generated by users to retrieve information to answer the question `Who enters the tunnel before you? ‘ under QS (correct answer 1 Spot and 1 ATR) and the different results produced by the model is given in the following figure.

4 images produced by the model (Model Answer) under QS task is shown below for 6 different user queries (Query). The correct answer is produced by the 2 queries highlighted in light blue: all terrain robot and tunnel entrance robot. t

  1. CSIRO Robotics, Clayton, Melbourne, Australia  2

  2. Monash University, Clayton, Melbourne, Australia