Three Tasks Of A VLM
ArchiveMedium
A capable vision-language model is usually benchmarked on three task types: image _____(1), visual question _____(2), and referring expression _____(3). Fill the 3 blanks. Flag format: CONGRESS{1:[word],2:[acronym],3:[word]}. Example: CONGRESS{1:captioning,2:vqa,3:grounding}.
Show hint
What the model writes, what it answers, where it points.
Archive — no submissions accepted
This challenge is preserved for reference. Play live challenges at /challenges.