Title: Seeing past words: Testing the cross-modal capabilities of pretrained V&L models
Speaker: Letitia Parcalabescu (ICL)
Abstract
In this talk, we will first explain (a) what "pretrained vision and
language (V&L) models" are, (b) give a short overview over the
architecture of these V&L models and (c) state their contributions to
vision and language integration.
Secondly, we will present our recent work where we investigate the
reasoning ability of pretrained vision and language (V&L) models in two
tasks that require multimodal integration: (1) discriminating a correct
image-sentence pair from an incorrect one, and (2) counting entities in
an image. We evaluate three pretrained V&L models on these tasks:
ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned
settings.
Our results show that models solve task (1) very well, as expected,
since all models are pretrained on task (1). However, none of the
pretrained V&L models is able to adequately solve task (2), our counting
probe, and they cannot generalise to out-of-distribution quantities. We
propose a number of explanations for these findings: LXMERT (and to some
extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on
task (1).
Concerning our results on the counting probe, we find evidence that all
models are impacted by dataset bias, and also fail to individuate
entities in the visual input. While a selling point of pretrained V&L
models is their ability to solve complex tasks, our findings suggest
that understanding their reasoning and grounding capabilities requires
more targeted investigations on specific phenomena.