READ

Abstract

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks.

Introduction

Previous works fine-tune adaptation layers (Adapter) to mitigate the high computational cost and reduce overfitting as obtaining video-language data is expensive.
However, the adapters pay little attention to temporal modeling for video.
Moreover, the relationship between video and language modalities is also neglected. Such relationship is necessary to ground the entities or understand the language context.

Methodology

We propose REcurrent ADapter (READ) to help low-rank adapters capture temporal information.
We integrate Partial Video-Language Alignment (PVLA) to encourage the alignment between video and language modalities.

Experimental Results

We evaluate our method on two prominent temporal grounding and video-text summarization tasks.

Our framework outperforms other fine-tuning approaches with significantly low trainable parameters.

BibTeX


      @article{nguyen2024read,
        author    = {Nguyen, Thong and Wu, Xiaobao and Dong, Xinshuai and Le, Khoi M and Hu, Zhiyuan and Nguyen, Cong-Duy and Ng, See-Kiong and Luu, Anh Tuan},
        title     = {READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling},
        journal   = {AAAI},
        year      = {2024},
      }

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

TLDR: Introduction of recurrent low-rank adaptation layers supported by an optimal-transport-based alignment objective between video and language representations in limited training data settings.

Abstract

Introduction

Methodology

Experimental Results

BibTeX