New MOE Tier 2 grant on music generation with emotion that matches video

Having done my postdoc and PhD on music generation (see MorpheuS project), I am happy to announce that I am the PI of a new MOE Tier 2 grant on 'aiMuVi: AI Music generated from Videos' with SGD 648,216 in funding. My co-PIs on this project are Prof. Gemma Roig from SUTD, Prof. Eran Gozy from MIT (creator of Guitar Hero), and collaborator Prof. Kat Agres from A*STAR/NUS.

Abstract of the proposal:

The idea of music generation is as old as computing itself, with Ada Lovelace herself making reference to machines that would one day create ``elaborate and scientific pieces of music of any degree of complexity and extent''(1843). State-of-the-art music generation systems, however, are not yet part of our society, simply because they cannot address some critical challenges like long-term structure (e.g. repeated themes) and controlling emotional content in music. This proposal directly addresses these remaining challenges, including research questions such as how can we use AI to recognise objects from video; how can we measure perceived emotions from music and video; how can we tie all this together and generate music that matches a video?

Deep learning has recently transformed the field of image classification. In this project, we aim to elicit a similar step-change in the fields of both digital music generation and video processing. More specifically, we will develop a cutting-edge intelligent system that can generate music using deep learning memory structures and hierarchical models; recognise features from video through an AI model; combine these two systems to form a revolutionary system that can generate music to match video. The final result will be implemented as a smart phone app, so that it can reach the widest possible audience and become an integrated part of Singapore's smart society.

The main scientific goals of the project can be summarised as follows:

  1. Advancing state-of-the-art deep learning models to integrate long-term musical structure in these models with memory networks;
  2. Modelling perceived emotion/tension from music to allow music generation that matches film/games;
  3. Extracting content from video with deep learning methods by representing videos with global features that capture temporal information, as well as climax, emotion and tension, all crucial to generate music that matches video content;
  4. Using AI methods to generate music that fits the detected video content;
  5. Evaluate the final system through user testing by assessing the qualitative perception of the final result, and build a prototype app that integrates the music from video AI system.

Both of the investigators on this project have extensive experience in deep learning for both digital music generation and image processing. Preliminary models have already been built on music generation, capturing tension from music, image object recognition. These show the potential of the proposed approach and testify to the extensive experience the investigators have in this niche field. This work is situated in the area of digital audio entertainment, with a global consumer expenditure of 96 billion USD in 2014, projected to increase. Recent efforts from companies such as The Google Brain team’s Magenta, a Tensor Flow library for deep learning for music, further testify to the timeliness and importance of the topic. The proposed AI music generation system has direct applications in game music, interactive arts, personal entertainment videos (e.g. YouTube) and stock music for advertising videos.