Title: VALSE: A New Foiling Benchmark for Vision and Language Models
Speaker: Letitia Parcalabescu (ICL)
Abstract
We will present preliminary insight and results on VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for investigating general-purpose pretrained vision and language (V&L) models for specific linguistic grounding capabilities. Currently, such V&L models are evaluated on tasks such as image-text alignment, visual question answering, or image retrieval. While performance on these tasks is important, they do not disentangle the fine-grained linguistic capabilities of V&L models.By contrast, VALSE comprises a suite of six specific linguistic phenomena grounded in the visual modality. With this benchmark, our goal is to evaluate V&L models based on fine-grained and targeted probes for grounding language in vision.
We first describe the construction of VALSE, focusing specifically on methods that support the construction of good foils. Moreover, we present results for five widely-used pretrained V&L models on VALSE---CLIP, LXMERT, VisualBERT, ViLBERT and ViLBERT 12-in-1---in zero-shot and fine-tuned settings.
Our zero-shot experiments suggest that pretrained models do not have a strong sensitivity (if any) for the linguistic phenomena grounded in images. Finetuned models are able to capture different phenomena with different degrees of accuracy, but leave room for improvement. We expect VALSE to be an important step for better understanding pretrained V&L models.