FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Published in CVPR, 2025
We introduce FRAMES-VQA, a benchmark designed to evaluate robust fine-tuning strategies for visual question answering (VQA) under diverse multi-modal distribution shifts. By leveraging ten existing VQA datasets categorized into in-distribution (ID), near-OOD, and far-OOD scenarios, we systematically analyze the impact of uni-modal, multi-modal, and adversarial shifts. Our study compares existing robust fine-tuning methods, quantifies distribution shifts using Mahalanobis distance, and explores the interactions between uni- and multi-modal shifts, providing valuable insights for developing more robust VQA models.
Download here