Why multimodal NLP Models? Humans use language in highly multimodal context, where extra-linguistic information grounds, disambiguate, complements or clarifies what we hear/utter.

Visual and any other sensorimotor information plays a key role in forming and shaping our concepts.

THe symbol grunding problem, Harnard 1990

These models follow a “solipsistic” route to semantics in which the meaning of a word is entirely accounted for by patterns of co-occurrence with other words [Baroni 2016]

Octupus has never observed these objects, and thus would not be able to pick out the referent of a word when presented with a set of physical alternatives [Bender & Koller 2020]

Text-only model may not account for the full, entire feature of a language you use.

We also want AI models to perform useful tasks in the world, e.g., tell us what a sign means or create advertising copy to sell our bike. To do that they must be able to understand a multimodal input, and generate valid, plausible, and aligned language.

Two Historical approaches:

  1. Visually-grounded semantics: combination of known textual and visual features to obtain human-like representation which are task-agnostic.
  2. Multimodal Machine Learning: joint multimodal processing to align/integrate information from language and vision (task-oriented);

The way it’s done is to have multiple inputs in various modality, seperate encoders for each of this modality, and they have been pre-trained and frozen and the LLM decides what kind of output to produce which is then diffused, to obtain the output.


Foundations

During the pre-transformer era the multimodal sector was powered by the deep learning revolution, there were a few key tasks - and corresponding models to solve them - were proposed:

  • Image Captioning;
    • Given an image, generate its description in natural language;
    • The pipeline was having an input image, that used a CNN that created a feature vector, representing the image, and than this feature vector was used as an input to the RNN (LSTM) that generates the caption.
      At some point was the use of the attention to mechanism between the CNN and the RNN, so the RNN attended certain parts of the image.
  • Visual Question answering;
    • Given an image you may ask question about the image and expect the model to respond.
  • Visual dialogue;
    • Treated as a discriminative task.

Transformers work better.
Trasnformer VLMs work by turning any task into a generative one. l

Clip maximize the similarity of pair embeddings (image/text).

Visual transformer tokenizes the image, uses positional embeddings and trains the encoder.


Visual Storytelling

You have an input that is a sequence of images and the task is to generate a textual story consistent with the input.

How good are this model in generating narrations? and how can we evaluate them?

Reference-based: meteor, bert-score, belurt etc.
Reference-free Rovist produce a score fo rvisual grounding, coherence, and non rendundancy.