Sara Ghazanfari

I’m a Ph.D. candidate at New York University in the EnSuRe Research Group, where I’ve been a research assistant since January 2023. I'm pleased to be co-advised by Siddharth Garg and Farshad Khorrami. My research focuses on advancing the visual capabilities of multimodal large language models (LLMs). During summer 2025, I have joined Adobe as a research intern to continue pursuing this line of work on enhancing Multimodal LLMs.

Email / Google Scholar / GitHub / Twitter / LinkedIn / CV

Research Overview

My research focuses on advancing the visual capabilities of multimodal large language models (LLMs), with a particular emphasis on enhancing their spatial reasoning and perceptual abilities. In my latest work, Chain-of-Frames (CoF), we propose a novel framework for video LLMs that grounds reasoning in explicit frame references, improving interpretability and performance on complex video question-answering tasks. Before that, UniSim introduced a unified benchmark and models for multimodal perceptual similarity tasks, uncovering key insights into the generalization challenges faced by current SOTA perceptual metrics. Earlier, EMMA presented an efficient modality adaptation module that aligns visual and textual representations, boosting cross-modal performance and robustness in Multi-Modal Large Language Models (MLLMs) with minimal computational overhead.

News

July-2025: Served as a reviewer for NeurIPS 2025.

June-2025: Chain-of-Frames paper now on arXiv, with code and models on GitHub.

May-2025: Joined Adobe as a Research Intern to advance research on unified multimodal models.

May-2025: One paper accepted to TMLR 2025.

April-2025: One paper accepted to CVPR Workshop 2025.

Jan-2025: Served as a reviewer for CVPR 2025.

Jan-2024: One paper accepted to ICLR 2024.

June-2023: One paper accepted to ICML Workshop 2023.

Publications

	Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning S. Ghazanfari, F. Croce, N Flammarion, P. Krishnamurthy, F. Khorrami, and S. Garg *Submitted to NeurIPS 2025* PDF / arXiv / code We propose chain-of-frames (CoF) to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant frames . We first create a large dataset of diverse questions, answers, and reasoning traces with references to frame IDs from both natural and synthetic videos. Then, we fine-tune existing video LLMs on this chain-of-frames data (CoF-Data). Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks or complex inference pipelines. Our CoF-InternVL2.5-4B and CoF-InternVL3-8B models, based on CoF, outperform the baselines across several benchmarks (right figure above). Moreover, they generate interpretable reasoning traces that accurately refer to the key frames to answer the given question.
	Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics S. Ghazanfari,S. Garg , N Flammarion, P. Krishnamurthy, F. Khorrami, and F. Croce *CVPRW 2025* PDF / arXiv / code In this our work, we propose UniSim-Bench, the first benchmark to track the progress of perceptual similarity metrics across uni- and multimodal tasks. We identify the limitations of current specialized perceptual in generalizing to unseen datasets and perceptual tasks. We propose UniSim, a set of multi-task perceptual models which are a first step towards general-purpose perceptual metrics. Together, UniSim-Bench and UniSim lay the groundwork for understanding the challenges of learning automated metrics that broadly mimic human perceptual similarity, beyond narrow, task-specific applications.
	EMMA: Efficient Visual Alignment in Multi-Modal LLMs S. Ghazanfari, A. Araujo, P. Krishnamurthy, S. Garg and F. Khorrami *TMLR 2025* PDF / arXiv / code In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
	LipSim: A Provably Robust Perceptual Similarity Metric S. Ghazanfari, A. Araujo, P. Krishnamurthy, F. Khorrami and S. Garg *ICLR 2024* PDF / arXiv / code In this work, we demonstrate the vulnerability of the SOTA perceptual similarity metric based on an ensemble of ViT-based feature extractors to adversarial attacks. We then propose a framework to train a robust perceptual similarity metric called LipSim (Lipschitz Similarity Metric) with provable guarantees by leveraging 1-Lipschitz neural networks as backbone and knowledge distillation approach to distill the knowledge of the SOTA models. Finally, a comprehensive set of experiments shows the performance of LipSim in terms of natural and certified scores and on the image retrieval application.
	R-LPIPS: An Adversarially Robust Perceptual Similarity Metric S. Ghazanfari, S. Garg, P. Krishnamurthy, F. Khorrami and A. Araujo *ICML Workshop 2023* PDF / arXiv / code In this work, we show that the LPIPS metric is sensitive to adversarial perturbation and propose the use of Adversarial Training to build a new Robust Learned Perceptual Image Patch Similarity (R-LPIPS) that leverages adversarially trained deep features. Based on an adversarial evaluation, we demonstrate the robustness of R-LPIPS to adversarial examples compared to the LPIPS metric. Finally, we showed that the perceptual defense achieved over LPIPS metrics could easily be broken by stronger attacks developed based on R-LPIPS.