A Guide to State of the Art Object Detection for Multimodal NLP
Tai Mai
If you’re reading this, chances are you’re a computational linguist and chances are you have not had a
lot of contact with computer vision. You might even think to yourself “Well yeah, why would I? It has nothing to do with
language, does it?” But what if a language model could also rely on visual signals and ground language? This would
definitely help in many situations: Take, for example, ambiguous formulations where textual context alone cannot decide
whether “Help me into the car!” would refer to an automobile or a train car. As it turns out, people are working very
hard on exactly that; combining computer vision with natural language processing (NLP). This is a case of so called
Multimodal Learning.