How difficult is matching images and text? – An investigation
Siting Liang
Ordinarily, we call the channels of communication and sensation modalities. We experience the world
involving multiple modalities, such as vision, hearing or touch. A system which handles dataset includes multiple
modalities is characterized as a multi-modality modal. For example, MSCOCO dataset contains not only the images, but
also the language captions for each image. Utilizing such data, a model can learn to bridge the language and vision
modalities.