Publications

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

Published in ICCV-W, 2025

Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics – such as copying or majority voting – rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

Download here

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Published in CVPR, 2025

We introduce FRAMES-VQA, a benchmark designed to evaluate robust fine-tuning strategies for visual question answering (VQA) under diverse multi-modal distribution shifts. By leveraging ten existing VQA datasets categorized into in-distribution (ID), near-OOD, and far-OOD scenarios, we systematically analyze the impact of uni-modal, multi-modal, and adversarial shifts. Our study compares existing robust fine-tuning methods, quantifies distribution shifts using Mahalanobis distance, and explores the interactions between uni- and multi-modal shifts, providing valuable insights for developing more robust VQA models.

Download here

Directional Gradient Projection for Robust Fine-tuning of Foundation Models

Published in ICLR, 2025

This paper introduces Directional Gradient Projection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning.

Download here

Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models

Published in NeurIPS, 2024

This paper introduces Selective Projection Decay (SPD), a weight decay technique that selectively regularizes certain layers to balance fitting and retaining pre-trained knowledge, improving generalization and robustness when fine-tuning foundation models.

Download here

From Local to Global: Spectral-Inspired Graph Neural Networks

Published in NeurIPS GLFrontiers Workshop, 2022

This paper is about mitigating over-smoothing and over-squashing issues in deep GNNs by proposing a normalization technique in message-passing algorithms (PowerEmbed) to encode global spectra information inspired by spectral embeddings.

Download here