Meta Interview Question

How to design a multimodal LLM? Coding: DFS on tree.