MAMA: A Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

1National University of Singapore (NUS), 2Nanyang Technological University (NTU), 3Carnegie Mellon University (CMU), 4VinAI Research

Overview


Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
JPG

MAMA's Examples



Video



 

MAMA
The video captures the driver's actions and the car's movement in a dynamic and engaging manner, providing a comprehensive view of the driving experience. The video shows a person preparing a meal in multiple steps, with various ingredients and utensils being used. The video shows a step-by-step process of a car being built, starting with the initial design and ending with the final product.
 

Video



 

MAMA
The video captures a soccer game in progress, showing multiple players on the field, with the focus on the goalie and the soccer ball. The video captures a basketball game in progress, with multiple players on the court and a crowd of spectators watching the game. The video shows a person painting a room, with multiple shots of the process, including the initial preparation, the painting itself, and the final result.

Main Results



(a) State-of-the-art results on popular VideoQA and text-to-video retrieval (TVR) tasks.

(b) Our MAMA framework can adaptively assign weights to the loss values of training samples.

(c) The less popular the topic of a training sample is, the more improvement MAMA obtains.

BibTeX


      @article{nguyen2024meta,
        author    = {Nguyen, Thong and Bin, Yi and Wu, Xiaobao and Dong, Xinshuai and Hu, Zhiyuan and Le, Khoi and Nguyen, Cong-Duy and Ng, See-Kiong and Tuan, Luu Anh},
        title     = {Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning},
        journal   = {ECCV},
        year      = {2024},
      }