Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example
Multimodal AI agents, those that can process text and images (or other media), are rapidly entering real-world domains like autonomous driving, healthcare, and robotics. In these settings, we have traditionally used vision models like CNNs; in the post-GPT era, we can use vision and multimodal language models that leverage human instructions in the form of prompts, rather than task-oriented, highly specific vision models.

