Causal music video reasoning...

When a rock band's intensity peaks, the camera cuts to close-ups. When the scene fades to slow motion, the music softens. We all sense this relationship, but can AI models actually reason about it causally?

We introduce KARMA-MV, a benchmark for causal question answering on music videos.

The core idea: instead of asking "what is happening in this video?", we ask "why did the music change when the visuals changed?", thus covering evidence reasoning, prediction, and counterfactual questions.

What we built:

- 37,737 multiple-choice questions from 2,682 YouTube music videos
- An automated pipeline using LLMs for scalable dataset generation (no manual annotation)
- A Causal Knowledge Graph (CKG) that encodes how visual shifts drive musical changes
- Experiments on Qwen-2.5-Omni-7B, MiniCPM-o-4.5, and Gemma-4-31B Instruct

Key finding: smaller models benefit the most from CKG grounding. Qwen-2.5-Omni-7B gained 7.37 percentage points overall, with a 12-point jump on counterfactual questions specifically. Larger models, which already internalize more causal world knowledge, gain less from external graph injection.

This suggests a practical path forward: a smaller model + lightweight knowledge graph can approach the performance of a much larger model, at a fraction of the cost.

Congrats to authors Archishman Ghosh and Abhinaba Roy, Ph.D. for their hard work on this.

For those working on multimodal reasoning: do you think explicit causal structure (like knowledge graphs) has a long-term role as models scale, or will it be absorbed into model weights eventually?