Multimodal & Vision

Three Tokens Of LLaVA Input

Archive

Easy

100pts0 solves

A LLaVA-style VLM receives three input streams: _____(1) patches (via CLIP), _____(2) tokens (via a tokenizer), and an optional _____(3) template for chat. Fill the 3 blanks. Flag format: CONGRESS{1:[word],2:[word],3:[word]}. Example: CONGRESS{1:image,2:text,3:instruction}.

Show hint

Three modalities-ish: what you see, what you read, what you're asked.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.