Papers
arxiv:2306.03802

Learning to Ground Instructional Articles in Videos through Narrations

Published on Jun 6, 2023
· Submitted by akhaliq on Jun 7, 2023
Authors:
,
,

Abstract

In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the <PRE_TAG>step descriptions</POST_TAG> from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and <PRE_TAG>step descriptions</POST_TAG>. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) {\em direct} alignment of <PRE_TAG>step descriptions</POST_TAG> to frames, ii) {\em indirect} alignment obtained by composing steps-to-<PRE_TAG>narrations</POST_TAG> with narrations-to-video correspondences. Notably, our approach performs global <PRE_TAG>temporal grounding</POST_TAG> of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new evaluation benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100MA test server is accessible at \url{https://eval.ai/web/challenges/challenge-page/2082.} with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-<PRE_TAG>modality alignment</POST_TAG> yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.03802 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.03802 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.03802 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.