Emotional intelligence was believed to be one domain, that was dominated by humans and has been that quality which machines, in particular, or AI was considered not to be able to handle. AI has mostly been automating tasks that were repetitive in nature and that which required memorization and logic.
But mostly we are not able to identify other people’s emotions. As most of us are ignorant about what we emotional triggers we generate in others, we are rendered clueless about their actions that arise as a consequence of that . We also often tend to judge people based on stereotypes.
However, National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, has made a major breakthrough with AI in the identification of human emotions. In their paper called “Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio video Emotion Recognition”, they have defined how AI systems can recognize human emotions based on certain defined benchmarks.
Identifying emotions has always remained a challenge due to the abstractness of the concept involved. AER or Automatic Emotion Recognition is a challenging task, involving cognitive processes in humans. It makes use of audio and visual information in AER that can be accomplished using a neural network.
It makes use of an audio -processing algorithm which along with speech spectrograms can help AI in focussing on areas that are relevant in capturing emotional behavior.
Another element runs frame by frame video through two computing layers: one is through face detection algorithm and the other is through cutting edge facial recognition networks with a sole focus on emotional relevance. It is actually trickier than it seems though as all the frames contribute to a particular emotional state. So the challenge the developers faced was to introduce a mechanism that extracted the important frames.
After evaluating all the identifiable characteristics, they are combined with speech features for a more accurate analysis of facial recognition and speech patterns for the final outcome of the prediction.
The researchers demonstrated this in the ACM international conference with the help of 653 videos and related audio clips and a database of film and televisions shows used in audio-video sub challenge which was a grand challenge in the ACM international conference on multi modal Interaction.
On a validation set of 383 samples, it had managed to categorize emotions from seven choices - “ angry,” “disgust,” “fear,” “happy,” “neutral,” “sad,” and “surprise” with an accuracy level of 62.48 %.
The researchers showed how the video frames were related to audio signals. Or how the AI system took the relationship between speech and facial expressions for making the predictions.
The model could read emotions that had characteristics like “angry,” “happy,” and “neutral,” while it took some time to distinguish “disgust,” “surprise,” and certain other emotions with expressions that could be easily confused.
Yet it performed at par with a former approach that used five visual models and two audio models.
However, the researchers were of the opinion that their proposed approach could achieve a comparable result with a single model and perform better with multi-models