What if music similarity wasn't a single number?
Most retrieval and recommendation systems collapse melody, rhythm, and timbre into one monolithic score, making it hard to understand why two pieces are considered similar, let alone perform more nuanced searches.
In our latest work from AMAAI Lab at SUTD, we introduce MERIT (Music rEpresentations for dIsentangled similaRiTy) -- a framework that learns separate, interpretable representations for each of these three dimensions, built on top of a frozen foundation model.
Rather than training a new encoder from scratch, MERIT adds three lightweight projection heads to an existing audio backbone, turning a general-purpose foundation model into a structured similarity engine with distinct, queryable dimensions. This opens the door to applications beyond retrieval -- including attribute-based music attribution and rights analysis.
Some highlights:
Factor-specific similarity representations rather than a single entangled embedding
Controlled training examples built using conditional audio generation and source-separated stems
Strong disentanglement: each head responds to its intended dimension while staying near chance on the others
Holds up on real-world audio, not just the synthetic training domain
Authors: Abhinaba Roy, Junyi Liang, Dorien Herremans
As disentangled representations like these mature, could they reshape how we think about music attribution and similarity at scale?
