AttendAffectNet – Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-attention

TitleAttendAffectNet – Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-attention
Publication TypeJournal Article
Year of Publication2021
AuthorsT. Phuong HThi, BT B, Roig G., Herremans D.
JournalSensors. Special issue on Intelligent Sensors: Sensor Based Multi-Modal Emotion Recognition

In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, those typically , while ignoring the correlation between multiple modality inputs nor the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture namely AttendAffectNet (AAN), which uses the self-attention mechanism for predicting emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are taken into account for predicting emotions (expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie excerpt, so as to capture the relationship between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, the self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying the self-attention first on vectors of features obtained from different modalities in each movie segment, and then on the feature representations of all subsequent (temporal) movie segments. We extensively train and validate our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs perform better than those using features extracted from each modality separately. In addition, the Feature AAN outperforms other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performs better than the baseline models when predicting the valence dimension.