Multimodal LLM¶

2024/08/13
in Multimodal LLM
4 min read

Cambrian-1: Vision-Centric Search

Cambrian-1 is a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can boost multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research.

alt text

Cambrian-1 is built on five key pillars, each providing important insights into the design of multimodal LLMs (MLLMs):

Visual Representations: They explore various vision encoders and their combinations.
Connector Design: They design a new dynamic, spatially-aware connector that integrates visual features from several models with LLMs while reducing the number of tokens.
Instruction Tuning Data: They curate high-quality visual instruction-tuning data from public sources, emphasizing distribution balancing.
Instruction Tuning Recipes: They discuss strategies and best practices for instruction tuning.
Benchmarking: They examine existing MLLM benchmarks and introduce a new vision-centric benchmark called "CV-Bench".

We'll learn how Cambrian-1 works with an example of Vision-Centric Exploration on images found through vector search. This will involve two steps. 1. Performing vector search to get related images 2. Use obtained images for vision-centric exploration