Archive
Multimodal & Vision

Three Tasks Of A VLM

Archive
Medium
150pts0 solves
A capable vision-language model is usually benchmarked on three task types: image _____(1), visual question _____(2), and referring expression _____(3). Fill the 3 blanks. Flag format: CONGRESS{1:[word],2:[acronym],3:[word]}. Example: CONGRESS{1:captioning,2:vqa,3:grounding}.
Show hint
What the model writes, what it answers, where it points.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.