Sara Ghazanfari
I am currently a Ph.D. candidate at New York University in the
EnSuRe Research Group.
I'm pleased to be co-advised by
Siddharth Garg and
Farshad Khorrami. My current research is focused on building robust and scalable multimodal
perception systems that can operate in the real world. In my latest work,
UniSim, we propose
a unified benchmark and models for multimodal perceptual similarity tasks,
revealing key insights into generalization challenges of current SOTA perceptual metrics.
My earlier work, EMMA,
introduces a highly efficient modality adaptation module that
aligns visual and textual representations, improving cross-modal performance and robustness
in Multi-Modal Large Language Models (MLLMs) while adding minimal computational overhead. My earlier work
LipSim
introduces a robust perceptual metric
by leveraging Lipschitz Model as backbone and distilling from the SOTA perceptual metric.
Email  / 
Google Scholar
 / 
GitHub  / 
Twitter  / 
LinkedIn  / 
CV
|
|
News
- Jan-2024: One paper accepted to ICLR 2024.
- June-2023: One paper accepted to ICML Workshop 2023.
|
|
Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics
S. Ghazanfari,S. Garg , N Flammarion, P. Krishnamurthy, F. Khorrami, and F. Croce
Submitted to CVPR, 2025
PDF /
arXiv /
code
In this our work, we propose UniSim-Bench,
the first benchmark to track the progress of perceptual
similarity metrics across uni- and multimodal tasks.
We identify the limitations of current specialized perceptual
in generalizing to unseen datasets and perceptual tasks.
We propose UniSim, a set of multi-task perceptual models which
are a first step towards general-purpose perceptual metrics.
Together, UniSim-Bench and UniSim lay the groundwork for
understanding the challenges of learning automated metrics
that broadly mimic human perceptual similarity, beyond narrow,
task-specific applications.
|
|
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
S. Ghazanfari, A. Araujo, P. Krishnamurthy, S. Garg and F. Khorrami
Submitted to ICLR, 2025
PDF /
arXiv /
code
In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality
module designed to efficiently fuse visual and textual encodings, generating instruction-aware
visual representations for the language model. Our key contributions include: (1) an efficient
early fusion mechanism that integrates vision and language representations with minimal added
parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis
that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments
that demonstrate notable improvements on both specialized and general benchmarks for MLLMs.
Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while
significantly improving robustness against hallucinations.
|
|
LipSim: A Provably Robust Perceptual Similarity Metric
S. Ghazanfari, A. Araujo, P. Krishnamurthy, F. Khorrami and S. Garg
ICLR, 2024
PDF /
arXiv /
code
In this work, we demonstrate the vulnerability of the SOTA perceptual similarity metric
based on an ensemble of ViT-based feature extractors to adversarial attacks.
We then propose a framework to train a robust perceptual similarity metric called LipSim
(Lipschitz Similarity Metric) with provable guarantees by leveraging 1-Lipschitz neural
networks as backbone and knowledge distillation approach to distill the knowledge of the
SOTA models. Finally, a comprehensive set of experiments shows the performance of LipSim
in terms of natural and certified scores and on the image retrieval application.
|
|
R-LPIPS: An Adversarially Robust Perceptual Similarity Metric
S. Ghazanfari, S. Garg, P. Krishnamurthy, F. Khorrami and A. Araujo
ICML Workshop, 2023
PDF /
arXiv /
code
In this work, we show that the LPIPS metric is sensitive to adversarial perturbation and propose the
use of Adversarial Training to build a new Robust Learned Perceptual Image Patch Similarity (R-LPIPS)
that leverages adversarially trained deep features. Based on an adversarial evaluation, we demonstrate
the robustness of R-LPIPS to adversarial examples compared to the LPIPS metric.
Finally, we showed that the perceptual defense achieved over LPIPS metrics could easily
be broken by stronger attacks developed based on R-LPIPS.
|
|