Skip to content

Multimodal LLM

Cambrian-1 is a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can boost multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research.

alt text

Cambrian-1 is built on five key pillars, each providing important insights into the design of multimodal LLMs (MLLMs):

  1. Visual Representations: They explore various vision encoders and their combinations.
  2. Connector Design: They design a new dynamic, spatially-aware connector that integrates visual features from several models with LLMs while reducing the number of tokens.
  3. Instruction Tuning Data: They curate high-quality visual instruction-tuning data from public sources, emphasizing distribution balancing.
  4. Instruction Tuning Recipes: They discuss strategies and best practices for instruction tuning.
  5. Benchmarking: They examine existing MLLM benchmarks and introduce a new vision-centric benchmark called "CV-Bench".

We'll learn how Cambrian-1 works with an example of Vision-Centric Exploration on images found through vector search. This will involve two steps. 1. Performing vector search to get related images 2. Use obtained images for vision-centric exploration