Encoding and Controlling Global Semantics for Long-form Video Question Answering

1National University of Singapore (NUS), 2Nanyang Technological University (NTU)


TLDR: The first two authentically long-form Ego-QA and MAD-QA datasets (average video lengths are 18 minutes and 2 hours, respectively), with impressive performance using gated state space layer supported by representational congruence.

Abstract

Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the controllability, we introduce a cross-modal compositional congruence (C^3) objective to encourage global semantics aligned with the question. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new as well as existing datasets.

Ego-QA and MAD-QA Datasets


JPG

The figure illustrates two examples of our Ego-QA and MAD-QA dataset, respectively. Question in the first video requires the model to reason about the relation chain of replacing ingot of palladium to activate the rt unit that powers the armored suit and protects person’s health. Question in the second video necessitates an understanding of the overall theme.


Gated State Space Layer and Compositional Congruence

JPG

We use state space layer to encode global semantics of long videos. Our global semantics is controlled by a gating layer and a compositional congruence objective, which in total boost the performance on 7 datasets including our 2 benchmarks.

Experimental Results

JPG
  • The results demonstrate the outstanding performance of our approach over existing methods.
  • As shown in the example, our model can choose the correct answer for questions that require long-narration understanding. In contrast, previous state-of-the-art MIST-CLIP struggles in such questions.
  • Our gated SSL is more efficient than the attention operation, while as efficient as convolution but much more effective.
JPG

BibTeX


      @article{nguyen2024encoding,
        author    = {Nguyen, Thong Thanh and Hu, Zhiyuan and Wu, Xiaobao and Nguyen, Cong-Duy T and Ng, See-Kiong and Luu, Anh Tuan},
        title     = {Encoding and Controlling Global Semantics for Long-form Video Question Answering},
        journal   = {EMNLP},
        year      = {2024},
      }