What is a multimodal prompt?

Multimodal prompting is a technique where you use multiple input formats to guide a large language model, instead of just relying solely on text. These input formats can include combinations of text, images, audio, code, or even other formats, depending on the model's capabilities and the task at hand. It refers to prompting where prompts may include media such as images.

As Generative AI models evolve beyond text-based domains, multimodal prompting techniques emerge. These techniques are often not simply applications of text-based methods but can be entirely novel ideas made possible by different modalities.

The sources discuss various specific areas of multimodal prompting techniques, including:

  • Image Prompting,
  • Audio Prompting
  • Video Prompting
  • Segmentation Prompting
  • 3D Prompting

Related post