Emotional Databases
Recent advancement in human-computer interaction (HCI) technology goes beyond the successful transfer of data between human and machine by seeking to improve the naturalness and friendliness of user interactions. The user's expressed emotion plays an important role in this regard. Speech is the primary means of communication between human beings, and if confined in meaning to the explicit verbal content of what is said, it does not by itself carry all the information conveyed. Additional information includes vocalized emotion, facial expressions, hand gestures and body language [1] as well as biometric indicators. In the design of an emotion recognition system, one important factor upon which the performance of emotion recognition system depends is the emotional database used to build its representation of humans emotions.
Emotional behavior databases have been recorded for investigation of emotion, where some are natural while others are acted or elicited. These databases have been recorded in audio, visual or audio-visual modalities. For the analysis of vocal expressions of emotions, audio databases such as AIBO, Berlin Database of Emotional Speech, and Danish Emotional Speech Database have been recorded [2, 3, 4]. The AIBO database [2] is a natural database which consists of recording from children while interacting with robot. The database has 110 dialogues and 29200 words in 11 emotion categories of anger, bored, emphatic, helpless, ironic, joyful, motherese, reprimanding, rest, surprise and touchy. The data labeling is based on listeners' judgment. The Berlin Database of Emotional Speech [3] is a German acted database, which consists of recordings from 10 actors (5 male, 5 female). The data consist of 10 German sentences recorded in anger, boredom, disgust, fear, happiness, sadness and neutral. The final database consists of 493 utterances after listeners' judgment. The Danish Emotional Speech Database [4] is another audio database recorded from 4 actors (2 male, 2 female). The recorded data consist of 2 words, 9 sentences and 2 passages, resulting in 10 minutes of audio data. The recorded emotions are anger, happiness, sadness, surprise and neutral.
Some facial expressions databases have been recorded for the analysis of facial emotional behavior. The Cohn-Kanade facial expression database [5] is a popular acted database of facial expressions, with recordings from 210 adults, in 6 basic emotions [6] and Action Units (AUs). The data are labeled using the Facial Action Coding System (FACS). The MMI database is a very comprehensive data set of facial behavior [7]. It consists of facial data for both acted and spontaneous expressions. The recorded data comprise both static images and videos, where large parts of the data are recorded in both frontal and profile views of the face. For the natural data, children interacted with a comedian, while adults responded to emotive videos. The database consists of 1250 videos and 600 static images in 6 basic emotions, single AU and multiple AUs. The data labeling is done by FACS and observers' judgment. The UT Dallas database [8] is a natural visual database which is recorded by asking subjects to watch emotion-inducing videos. The database consists of data from 229 adults in 6 basic emotions, along with puzzlement, laugh, boredom and disbelief. The data labeling is based on observers' judgment.
In recent years, some audio-visual databases have been recorded to investigate the importance of each modality and different fusion techniques to improve the emotion recognition performance. The Adult Attachment Interview (AAI) database [9] is a natural audio-visual database, which consists of subjects' interviews about their childhood experiences. The data consist of recordings from 60 adults and each interview lasts for 30-60 minutes. The database consists of the 6 basic emotions along with embarrassment, contempt, shame, plus general kinds of positive and negative emotion. The data labeling uses FACS. The Belfast database [10] is another natural audio-visual database which consists of clips taken from television and realistic interviews conducted by a research team. The database holds data from 125 subjects, as 209 sequences from TV and 30 from interviews. The data are labeled with both categorical and dimensional emotion using the Feeltrace system. The Facial Motion Capture database [11] consists of recordings from an actress, who was asked to read a phoneme-balanced corpus four times, expressing anger, happiness, sadness and neutral. The actress' facial expression and rigid head motion were acquired by a VICON motion capture system which captured the 3D positions' of 102 markers on her face. The total data consist of 612 sentences.
We wanted to design an audio-visual British English database suitable for the development of a multimodal emotion recognition system. We adopted a more controlled approach than [9] and [10], including phonetically-balanced sentences and 60 facial markers, to obtain phone-level annotations and coordinates of points on the actors' faces. In comparison to [11], we aimed to increase the number of actors and affect classes to cover all the basic emotions with an even distribution. For benchmarking purposes, we performed a subjective evaluation of our database under audio-only, video-only and audio-visual conditions, and report the classification results for both speaker-dependent and speaker-independent baseline systems.