Vision Language Models

← Back to main

Lecture

What is VLM (1.5 yrs) ago ▷ LLM that incorporates visual information. Figure (and text) fed, Make image tokens and text tokens. Go through the model. ▷ There are three ways: LLaVA, Fuyu, and BLIP way. LLaVA: Preprocess ,Patchify, Clip ViT (Vision Transformer), Projection. Fuyu: Preprocess, Patchify, Projection. BLIP: Preprocess, Patchify, CLIP ViT, Q-Former \& Sampler (more token-efficient); but hard to train Q-former.

What has been changed? ▷ More various modalities in both input and output. Test-time computing speed (modality agnostic). Multi-images. Multi-turn conversation. Agents (Data production, model inference, evaluation (via DAG of predicted task sequence))

Emerging properties in unified multimodal pretraining ▷ Summary: Introduces BAGEL, a 14B multi-modal open-source foundation model; highlights emerging properties of the order of learned capabilitiies (image understanding, image generation, image inpainting, intelligent image editing) ▷ Reasoning capability is trained as well in intelligent image editing. ▷ CLIP encoder does not tell about the logic; it is more about the tokens; for example "a car made of small cars", it detects car and generate car. However, results with thinking, you can actually generate car made of car. ▷ World model: you have image and action item (in text; e.g. move forward, turn left)-- you can generate a follow-up image based on this action.

BAGEL: scalaBle generAtive coGnitive modEL ▷ Used Mixture of Transformers (MoT): it has 'understanding' expert (also generates text) and 'generation' expert (only for image and video). ▷ Use different tokenizer for different experts: Understanding expert -- Vit of SigLIP2 style, which is an extension of CLIP; for Generation expert, it uses FLUX VAE, which is more semantic-related. ▷ How is visual info represented?: For muliple eimage inputs, create three sets of tokens: ViT token, VAE tokens, noisy VAE token. ▷ Did ablation study: MoT works better than MoE

DataSet curation ▷ Including video-originated data, which make a short clip of the original video and caption by large VLM; Reasoning augmented data, for text-to-image generation, free-form image editing (from existing dataset), conceptual editing ▷ One of big contribution of this paper: Text generation: reasonable to get clean text data from internet. However, for image data, internet is not enough for clean data; yet this paper shares the synthetic data generation.

▷ Training: alignment (SigLIP2 encoder), pre train (except VAE gen. encoder), mid training (continued training; increase visual input resolution and sampling ratio of synthetic data). supervised fine tuning. ▷ Emerging property: for the 4 properites mentioned earlier, there is a clear trend that, in terms of the number of trained tokens. there is an order that elbow happens.

Holistic Evaluation for Interleaved Text-and-Image Generation (Before BAGEL paper) ▷ Motivation: Interleaved text-and-image generation has some lacking part: limited output ,outdated metrics. Thus the paper proposed InterleavedBench (a dataset) and InterleaveEval (for evaluation). ▷ InterleavedBench: 2 subsets -- context-based (image + text generation, continuation with img+text) and context-free (img+text)

Dataset Curation ▷ Context-based: Wikihow; vist. Then, human selection. Then human write instructions. ▷ Context-free: GPT-4o generated synthetic. Then human selection.

Interleaved Eval ▷ Reference-free based metrcis, using GPT-4o as judge.

Experiments ▷ Reference-based baseline metrics: BERTScore, CLIPScore, DreamSim. And claimed their reference-free based model is better. Image style coherence was the most challenging/not-very-well performing part.