Multimodal & Vision

The Tiny Tokenizer Of Images

Archive

Medium

150pts0 solves

BLIP-2 uses a small transformer with learnable query tokens that cross-attend to the frozen vision encoder, producing a compact token stream for the LLM. What is this small module called?

Show hint

A letter + 'former'.

Archive — no submissions accepted

This challenge is preserved for reference. Play live challenges at /challenges.