Emotion Detection And Analysis | EDAA

8 min readDec 4, 2022

Emotion Detection and Analysis is a web application developed by the team The Mystic Forces as their final project of the AI5: productionizing AI course at Univ.ai under the guidance of Pavlos Protopapas (Scientific Program Director at the Institute for Applied Computational Science (IACS) at Harvard University) & Shivas Jayaram (Research @Harvard IACS | Deep Learning Researcher, Educator, and Practitioner). The web application is an end-to-end implemented deep learning project. During the project we developed the:

Frontend Web Application
Server
Backend Application
Deployment Strategy

Problem Statement

Public Speaking is not just a skill but an art which is not easily mastered. It has become an essential for every individual. In this digital world, where your office is your computer screen and online meeting platforms are the places to connect, one is asked to deliver presentations, briefings, and do meetings regularly. Some times they go good and sometimes they aren’t but what if we tell you that you can look back at your digital meetings, analyse your speech and get to know the public reaction? Yes, EDAA does it for you, using your digital meeting recording it provides you a detailed report of your speech and its impact on our audience.

Proposed Solution

Our proposed solution is to provide a detailed analysis of the speech and public reaction using Speech Sentiment Analysis coupled with Face Emotion Detection. We chop down the user’s videos into chunks, all 10 seconds long. The audio component and the video component is then extracted from these chunks. The audio chunks go first to an automatic speech detection model which converts the audio chunks to chunks of text and passes it to the sentiment classifier model. The sentiment classifier model classifies the text into one of the three classes which are positive, neutral, and negative respectively and stores the result in a GCS bucket. Similarly, the video chunks go to the GCS bucket. The face emotion detection api fetches these chunks, for each chunk we take 1 frame per second resulting into 10 frames for each chunk, we then pass these frames to the face emotion detection model which detects and classifies all the faces in these frames into one of the three classes mentioned before. The results are stored in the GCS bucket and our frontend is then able to display a detailed report based on these results. A detailed workflow of EDAA is shown in Fig 1.

Project Components

We divide EDAA into 4 basic building blocks, frontend, server, and the backend, deployment. The tech-stack includes:

FrontEnd: ReactJs, JavaScript, PlotlyJs
Modelling: Tensorflow, Pytorch
Backend: Python, FastAPI, GitHub(Source Control)
Deployment: Docker, Nginx, Google Cloud Platform(Compute Engine, Google Artifact Registry, Cloud Build)

In this section we will talk more about our project components.

Frontend

The frontend is a web application provided to the end user so that the users can upload their digital meetings and get detailed analysis. The web application is build entirely on ReactJs framework. ReactJs is an open source, JavaScript based framework which enable its users to create beautiful and smooth user interfaces.

Server

An important building block of any web application is the server, which exposes the UI/frontend the rest of the world, balances the load and routes the requests to its appropriate destination. EDAA make use of the famous Nginx server, lightweight and it serves our purpose.

Backend

The backend is purely written in python with help of FastAPI library. This where all the computation and processing takes place. The following components make up our backend:

Preprocessor API
ASR API
Sentence Bert APi
YOLO API
Storage

Preprocessor API: The preprocessor API is like as we say in football, the first line of defence. This API is exposed directly to the frontend. It is responsible to serve the requests made by the users via the frontend, create, & upload chunks to the storage and send informations to other APIs.

ASR API: This API hosts our ASR model which is the OpenAI’s whisper model, which receives the audio chunks from the preprocessor api, convert them to sentences and forward the request to sentence BERT api.

Sentence Bert API: It hosts the sentence BERT model which classifies text as positive, neutral, or negative, creates the results and pushes it to the GCS bucket.

YOLO API: It receives filename, corresponding to which it fetches all the video chunks, runs inference on them, curates the result and store them in GCS bucket

Storage: We use Google Cloud Storage, to store all our results and data. To get started you have to create a bucket on GCS and then you can add folders and data inside them.

Deployment

We containerized our Preprocessor, React Frontend, ASR, SBERT and YOLO components using Docker and then saved all our images to Google Artifact Registry and connected it with our GitHub repository using triggers i.e. whenever we push code to our repository, it automatically rebuilds images using Google Cloud Build and updates them into Artifact Registry.

Finally, after connecting the preprocessor to the ASR, SBERT and Yolo container using docker network, we connected the preprocessor and the frontend to the Nginx container. Below is the technical architecture of our app:

Models Considered

YOLO-V7

Prerequisites:

A very good GPU — We used a Nvidia A100 (40 GB)

2. A dataset — We modified the AffectNet dataset from Kaggle to work on Yolo. Link for the modified version à here

Training Yolov7 only requires setting up the training data correctly, rest of the steps involve editing of some configuration files (which allow Yolo to know about the training data).

Pre-processing the Dataset: So we start off with downloading the dataset from Kaggle, but as can be seen the dataset needs some pre-processing before passing it as input to Yolo. Since we only have to focus only on the face for emotion recognition, we passed the entire data through another Yolo model with weights pre-trained to recognize faces. But we don’t just need to “recognize” the faces, we “need” the faces themselves and thankfully data science community comes to rescue again. For context, Yolov5 had a save_crop feature which, as the name indicates, saves the detected area separately (i.e. crops it out). The last part of pre-processing would involve cleaning up some of the output faces that may have resulted due to misclassification by the model or due to some additional faces in a single image.

Setting up the data in Yolo format: Yolo requires every image to have a corresponding text file which must contain the information about the class to which the object (emotion here) belongs to along with the dimensions of the bounding box. Since we only have faces in an image now, we can set the entire image as the bounding box, reducing our work. In order for Yolo to recognize the corresponding text files for every image, they must have the same name. Last, they need to follow a certain directory structure i.e. there must be a “train” and a “val” folder, with both of them containing 2 more folders — “images” and “labels”. The data needs to be split between training and validation sets with the images and labels put in the respective folders (We can also add a “test” folder).

Changing Yolo configuration files: We just need to add the number of classes we have inside the model architecture file and a configuration file. Inside the configuration file we additionally need to specify the names of the classes our model needs to classify the images into (like anger, happy, sad etc.). To finish it up, we need to tell the model the exact location of our training and validation dataset.

Training: Now we are all done with fiddling the data, we now just have to run a single command and there we have a model that can detect emotions!

Detection: After a few hours of training on a powerful GPU, we get some file in the output directory. There in the weights folder we just need the “best.pt” to perform the detection task for our emotions. (It is preferable to have data in a similar format as it was trained on, for example, we again crop the faces out of any image or video we need to perform classification on). For videos, we first split them into individual images, generally 1 frame per second, as any particular emotion lasts for more than millisecond time frame.

Open-AI Whisper Model

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. It enables transcription in multiple languages, as well as translation from those languages into English[1]. We harvested the powers of the whisper model to extract accurate text sentences from the audio chunks.

Sentence-Bert

Sentence-BERT (SBERT) is a natural language processing model that has been trained to recognize emotions in text. This model is based on the BERT (Bidirectional Encoder Representations from Transformers) model, which is a state-of-the-art language model that has been trained on a large dataset of text and has been shown to perform well on a wide range of natural language processing tasks.

We have specifically trained SBERT to recognize emotions in sentences, using a dataset of labeled sentences that have been annotated with the appropriate emotional expression(The Daily Dialogue Dataset). This allows the model to take a sentence as input and output the most likely emotional state of the sentence (Positive, Negative, or Neutral).

Results & Conclusion

We developed a tool which takes into consideration multiple modes of data (audio and video) for emotion detection and analysis. The analysis created by this tool for each video is truly fascinating and useful for experienced professionals, corporates, as well as people starting their career in public speaking.

App Demo

Future Works

Creating an audio sentiment analysis

2. Training our yolov7 model on the entire Affectnet dataset .

3. Develop an email service which sends the link of the results of the user’s video to their email.

4. Adding functionality to dynamically create chunks so that each chunk has only one sentence each.

5. Develop a zoom plug-in so that users can record and analyse there meetings on the go.