Grant from MIT-SUTD IDC on "An intelligent system for understanding and matching perceived emotion from video with music"

A few months ago, Prof. Gemma Roig (PI, SUTD), Prof. Dorien Herremans (co-PI, SUTD), Dr. Kat Agres (co-PI, A*STAR) and Dr. Eran Gozy (co-PI, MIT, creator of Guitar Hero) got awarded a new grant from the International Design Center (joint research institute of MIT and SUTD) for 'An intelligent system for understanding and matching perceived emotion from video with music'. This is an exiting opportunity and the birth of our new Affective Computing Lab at SUTD that links the computer vision lab and AMAAI lab. Together with a newly awarded infrastructure Lab by IDC, which will allow for the lab's dedicated GDX server, we have started building an affective computing lab for audio and video over the last few months. I'm happy to report that our weekly meeting are attended by 14 people consisting of postdocs, PhD students, and even some dedicated undergrad students.

Abstract and project goals
In this project we aim to use AI for video and music processing so that we can match, in terms of perceived emotional content, music recommendations to detected video content. We will focus on perceived emotions, since measuring actual emotional responses physiologically or in the brain is out of the scope of these project, and would require elaborate technical techniques such as electroencephalography (EEG). While the field of vision has advanced a lot in recent years due to progress in convolutional neural networks, it has been mostly applied in still images for tasks such as object recognition in the image, and processing video remains a challenge for the more complex structure format of the data with the time continuity variable, and the changing of scenes. We aim to tackle these challenges by proposing novel neural network architectures for video feature processing and the perceived emotion detection in audio and video signals. Audio and video processing are two fields that have not been explored together in a multi-modal system before. Specifically, how to match perceived emotions of video and audio, as well as the effect in humans of displaying a video with an audio that matches or mismatches the emotion. In this project, we will focus on 5 aims:

  1. Build a system that predicts the perceived emotions that a video evokes to human viewers based on the content and the features of the video.
  2. Develop emotion recognition models for digital music and audio, based on preliminary work by the co-PI (Herremans and Chew, 2017).
  3. Combine the knowledge from 1) and 2) to train emotion recognition models based on video with music. These models will be able to detect emotion content based on video taking the music into account.
  4. Develop a novel affective computing system that can recommend music that matches the emotional content of videos.
  5. Test in humans the effect and change in the perceived emotion when A) viewing the video without music, B) viewing the video with music that matches the video perceived emotion (congruent condition), and C) viewing the video with music that does not match the perceived emotion evoked by the video alone (incongruent condition).

This project will have a broad impact both in academia and industry. In 2015, revenue from digital music channels accounted for more than 6.5 billion USD, which is 45% of the global music industry (McKinsey, 2016). Asia accounts for 14% of these revenues, with the number of digital music users expected to grow 15% a year until 2020. The proposed research targets this fast growing market by developing leading-edge artificial intelligence technologies that will facilitate emotion detection in both video and audio, resulting in smart affective recommendation systems. This has an enormous potential to use digital music platforms for acquiring data on emotion from audio and music, to then build predictive models of emotion from audio that can match automatically emotion from videos, or even manipulate the emotion perceived when the video and the audio are combined.

The affective computing application developed for this research will be useful for a large group of people including any filmmakers looking for appropriate music, YouTube posters who needs matching music to his video, and advertising industry wanting to find the right music for a video that displays the product such that it produces the desired perceived emotion to potential consumers. In a country such as Singapore, which is technological oriented and going toward a smart nation by incorporating intelligent systems for automatizing predictions, assistance to the public, and personalization of proposal of consumed content by the public, such a framework can be used and integrated in local advertising industries, potentiate the digital music industry, and even local producers that aim at generating content with audio-visual platforms.

Furthermore, the machine learning technologies developed will benefit both researchers in the field of computational vision and audio. Similar models based on deep learning will be used for both emotion detection in video and in audio. The main difference between the two is that in video the input are a sequence of image frames, which can be interpreted as a 2-dimensional input plus the time dimension. In audio and music the input to the model in order to predict emotions are 1-dimensional signals plus the time component. Since the audio signal can also be interpreted and represented as an image, as presented in previous work by the Prof. Dorien Herremans (co-PI), by taking frequency and time into consideration, this transformation might allow to treat the signals of audio and video in an equivalent manner. Also, in both cases, taking into account the implicit structure of the signals is very important because both, video and audio have structure over time and repetitions that are essential to capture in order the have a good representation of the input signals to then be able to detect emotions. Analysing this equivalences and similarities, and also investigating how to transfer the representation and structure of a model for one modality, for instance video, to the other, which is audio in the framework of this project, or vice-versa, it will also lead to new architectures and representations of the input signal. The automatic framework to detect and predict emotions for music and videos, based on the research proposal of this project, will open a new research line that will bring together scientific communities from computer vision and computational music. While we are testing new models and architectures for concrete applications in modelling temporal sequences and video feature extraction for emotion prediction, these will have equally an impact beyond those fields as they can be ported to other applications in which capturing temporal structure is key, including risk assessment, investment, planning. In all these applications, the temporal structure of the signal and the representation is key for a successful prediction, and knowing how to adapt the input signals to models that are able to represent and take into account structure is very important.