Archives

blog thumbnail

How difficult is matching images and text? – An investigation

Siting Liang

Ordinarily, we call the channels of communication and sensation modalities. We experience the world involving multiple modalities, such as vision, hearing or touch. A system which handles dataset includes multiple modalities is characterized as a multi-modality modal. For example, MSCOCO dataset contains not only the images, but also the language captions for each image. Utilizing such data, a model can learn to bridge the language and vision modalities.
blog thumbnail

Adversarial Training

Sebastian Burst

In the past two years, machine learning, particularly neural computer vision and NLP, have seen a tremendous rise in popularity of all things adversarial. In this blog post I will give an overview of the two most popular training methods that are commonly referred to as adversarial: Injecting adversarial examples (1) and min-max optimization (2). After showcasing how they are applied in NLP I will compare them and examine ways to combine them (3).
blog thumbnail

(Multimodal) Commonsense Reasoning: Where are we and where could we go?

Ines Pisetta

This blog post aims to give a broad overview over the development of Commonsense Reasoning (CR) in the field of Natural Language Processing (NLP) and its multimodal intersection with Computer Vision (CV). What makes CR so important in the age of Deep Learning, how did the approaches to it and datasets for it change over time, what are the main challenges, how can other modalities than language contribute towards CR in Machine Learning and why is Commonsense Reasoning still a hot topic?
blog thumbnail

Let’s evaluate with Macro F1: what can go wrong?

Juri Opitz, Sebastian Burst

Earlier this year we got slightly puzzled about how to best calculate the “macro F1” score to measure the performance of a classifier.
blog thumbnail

A Guide to State of the Art Object Detection for Multimodal NLP

Tai Mai

If you’re reading this, chances are you’re a computational linguist and chances are you have not had a lot of contact with computer vision. You might even think to yourself “Well yeah, why would I? It has nothing to do with language, does it?” But what if a language model could also rely on visual signals and ground language? This would definitely help in many situations: Take, for example, ambiguous formulations where textual context alone cannot decide whether “Help me into the car!” would refer to an automobile or a train car. As it turns out, people are working very hard on exactly that; combining computer vision with natural language processing (NLP). This is a case of so called Multimodal Learning.