The final step for many artificial intelligence (AI) researchers is the development of a system that can identify human emotion from voice and facial expressions. While some facial scanning technology is available, there is still a long way to go in terms of properly identifying emotional states due to the complexity of nuances in speech as well as facial muscle movement.
The University of Science and Technology researchers in Hefei, China, believe that they have made a breakthrough. Their paper, “Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition,” expresses how an AI system may be able to recognize human emotion through state-of-the-art accuracy on a popular benchmark.
In their published paper, the researchers say, “Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Inspired by this cognitive process in human beings, it’s natural to simultaneously utilize audio and visual information in AER … The whole pipeline can be completed in a neural network.”
A main component of the university’s AI system is audio-processing algorithms that, along with speech spectrograms, can aid the AI in focusing on areas that hold the most relevance for emotion detection. A second element runs frame-by-frame video through two computational layers: a face detection algorithm and a trio of cutting edge facial recognition networks with a sole focus on emotional relevance.
Once all measurable characteristics have been weighed, they are then fused with speech features for a more accurate association between facial recognition and speech patterns for a final emotion prediction.
The researchers put their work to the test in the ACM International Conference on Multimodal Interaction. Feeding the system with 653 videos and accompanying audio clips, the technology was able to categorize emotion from angry, disgust, fear, happy, neutral, sad, and surprised with a 62.48 percent accuracy rating on 383 samples. The AI system was able to discern the relationship between speech and facial patterns when predicting emotion. However, the AI still had trouble decoding certain characteristics related to disgust, surprise, and weakness.
“Compared with the state-of-the-art approach,” the researchers say, “The proposed approach can achieve a comparable result with a single model, and make a new milestone with multi-models.”