READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

1National University of Singapore (NUS), 2Nanyang Technological University (NTU), 3Carnegie Mellon University (CMU), 4VinAI Research


TLDR: Introduction of recurrent low-rank adaptation layers supported by an optimal-transport-based alignment objective between video and language representations in limited training data settings.

Abstract

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks.

Introduction


  • Previous works fine-tune adaptation layers (Adapter) to mitigate the high computational cost and reduce overfitting as obtaining video-language data is expensive.
  • However, the adapters pay little attention to temporal modeling for video.
  • Moreover, the relationship between video and language modalities is also neglected. Such relationship is necessary to ground the entities or understand the language context.
JPG

Methodology

  • We propose REcurrent ADapter (READ) to help low-rank adapters capture temporal information.
  • We integrate Partial Video-Language Alignment (PVLA) to encourage the alignment between video and language modalities.
JPG

Experimental Results

  • We evaluate our method on two prominent temporal grounding and video-text summarization tasks.
JPG
  • Our framework outperforms other fine-tuning approaches with significantly low trainable parameters.
JPG

BibTeX


      @article{nguyen2024read,
        author    = {Nguyen, Thong and Wu, Xiaobao and Dong, Xinshuai and Le, Khoi M and Hu, Zhiyuan and Nguyen, Cong-Duy and Ng, See-Kiong and Luu, Anh Tuan},
        title     = {READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling},
        journal   = {AAAI},
        year      = {2024},
      }