Surrey Audio-Visual Expressed Emotion (SAVEE) Database

Subjective evaluation

The quality of recorded data was checked by performing subjective evaluation of database. These tests provide a bench mark for evaluation of emotion recognition systems on this database. Each actor's data were evaluated by 10 subjects, of which 5 were native English speakers and the rest of them had lived in UK for more than a year. All of them were students at University of Surrey. It has been suggested that females experience emotion more intensively than men [16], so to avoid gender biasing half of the evaluators were female. The age of subjects was in the range of 21 to 29 years, with an average of 25 years for male subjects, 23 years for female subjects, and 24 years over all subjects. The subjective evaluation was performed at utterance level in three ways: audio, visual, and audio-visual. The 120 clips from each actor were divided into 10 groups, resulting 12 clips per group. To remove the systematic bias from the responses of evaluators, the groups were randomized. For each evaluator, a different data set was created for each of the audio, visual and audio-visual data per actor using a Balanced Latin Square [17], which resulted in 10 different sets for each of the audio, visual and audio-visual data per actor. The subjects were trained by using slides containing three facial expression pictures, two audio files, and a short movie clip for each of the emotions. Subjects were not given any additional speaker-dependent training, although some of the actors were known to some of them. The subjects were asked to play audio, visual and audio-visual clips and select from one of the seven emotions on a paper sheet. The responses were averaged over 10 subjects for each actor.

Classification accuracies for audio, visual and audio-visual data for 7 emotion classes over 4 actors by 10 evaluators are given in Table 1. The results show higher classification accuracy for the visual data compared to the audio, yet the overall performance improved by combining the two modalities. For the audio data, disgust was highly confused with neutral, fear with sadness and surprise, and happiness and surprise with each other. For the visual data, fear was highly confused with surprise, and for the audio-visual data, fear was confused with sadness. The results show greater clarity of the visual data compared to audio indicating the importance of facial expression. Overall, the expressed emotions for 441 out of 480 sentences were correctly classified by at least 8 out of the 10 subjects under audio-visual conditions, indicating good agreement with the actors' intended affect over the database.

Table 1: Average human classification accuracy (%) for 7 emotion classes, over 10 participants. Mean is averaged over 4 actors' data with 95% confidence interval (CI) based on standard error (n=40).
Modality	KL	JE	JK	DC	Mean (±CI)
Audio	53.2	67.7	71.2	73.7	66.5 ± 2.5
Visual	89.0	89.8	88.6	84.7	88.0 ± 0.6
Audio-visual	92.1	92.1	91.3	91.7	91.8 ± 0.1

Baseline system

Speaker-dependent classification

The speaker-dependent emotion classification have been performed by using 106 audio features related to Pitch, Energy, MFCCs and Duration, and 240 visual features related to Marker Position [20]. Features were selected by Plus l-Take Away r algorithm using Bhattacharyya distance criterion, to remove irrelevant data. The top 40 features chosen for each modality were subjected to Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) for feature reduction. Classification of audio, visual and combined audio-visual data was achieved with a Gaussian classifier. In the audio-visual experiments, the posterior probabilities from audio and visual modalities were multiplied with equal weighting to get the final result. Each actor's data were divided into 4 sets in a jack-knife procedure, where in each round three sets were used for training and one for testing, and the results averaged. The best classification accuracies for 7 emotion classes were achieved with 6 LDA (or 10 PCA) features. Mean classification accuracy for audio was 56% with LDA (50% with PCA), and for visual was 95% with LDA (92% with PCA). Fusion of both modalities improved the result to 98% for LDA (93% for PCA). To summarize, the visual features performed better than audio, while combined modalities worked best.

Speaker-independent classification

Speaker-independent experiments were performed on all extracted audio and visual features by leave-one-speaker-out procedure, and the results averaged over the 4 tests. The feature selection, feature reduction and Gaussian classification were performed as for the speaker-dependent case. In addition, a Support Vector Machine (SVM) classifier using RBF and Polynomial kernels was tested (using features after selection since feature reduction did not improve its performance). For audio, speaker normalization by mean and standard deviation increased the classification accuracy [21]. For visual, translation, rotation and mapping markers onto a reference speaker [19], improved the result. The SVM outperformed the Gaussian classifier for both audio and visual modalities. The best classification accuracy for 7 emotions was achieved with SVM and Polynomial kernel: 61% with 126 audio features, 65% with 85 visual features, and 84% with audio-visual combined at decision-level. The results reveal that both modalities contributed to the emotion classification and a substantial gain was achieved when combined.