Multimodal & Vision

Three Tasks Of A VLM

Archive

Medium

150pts0 solves

A capable vision-language model is usually benchmarked on three task types: image _____(1), visual question _____(2), and referring expression _____(3). Fill the 3 blanks. Flag format: CONGRESS{1:[word],2:[acronym],3:[word]}. Example: CONGRESS{1:captioning,2:vqa,3:grounding}.

Show hint

What the model writes, what it answers, where it points.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.