Facial Expression Analysis Using Deep Learning and Computer Vision

Author: Gordon Cooper, Marketing Manager for Embedded Vision Products at Synopsys

This article is supported by technology from “135 Editor”.

Recognizing facial expressions and emotions is a fundamental and very important skill in the early stages of human social interaction. Humans can observe a person’s face and quickly identify common emotions: anger, happiness, surprise, disgust, sadness, and fear. Conveying this skill to machines is a complex task. Researchers have spent decades in engineering design trying to write computer programs that can accurately recognize a feature, but have had to repeatedly start over to identify features that only differ slightly.

What if instead of programming the machine, we taught it to accurately recognize emotions?

Deep learning skills have shown tremendous advantages in reducing the error rates of computer vision recognition and classification. Implementing deep neural networks in embedded systems (see Figure 1) helps machines parse facial expressions visually and achieve human-like accuracy.

Facial Expression Analysis Using Deep Learning and Computer Vision Figure 1. A Simple Example of a Deep Neural Network

Neural networks can recognize patterns through training, and if they have input and output layers as well as at least one hidden intermediate layer, they are considered to have “deep” recognition capability. Each node is calculated from the weighted input values of multiple nodes in the previous layer. These weights can be adjusted to perform specific image recognition tasks. This is known as the neural network training process.

For example, to train a deep neural network to recognize photos of people with happy expressions, we show it happy images as raw data (image pixels) at the input layer. Knowing the outcome is happiness, the network will recognize patterns in the images and adjust the node weights to minimize errors in the happy category images. Each new image that shows a happy expression and is annotated helps optimize the image weights. With sufficient training input information, the network can ingest untagged images and accurately analyze and recognize patterns corresponding to happy expressions.

Deep neural networks require a lot of computational power to calculate the weights of all these interconnected nodes. Additionally, data memory and efficient data movement are also important. Convolutional Neural Networks (CNNs) (see Figure 2) are currently the most efficient deep neural networks for vision. CNNs are more efficient because they can reuse a large amount of weight data between images. They utilize the two-dimensional input structure of data to reduce redundant calculations.

Facial Expression Analysis Using Deep Learning and Computer Vision Figure 2. Example Architecture (or Schematic) of CNN for Facial Analysis

Implementing CNNs for facial analysis requires two unique and independent phases. The first is the training phase. The second is the deployment phase.

The training phase (see Figure 3) requires a deep learning framework – such as Caffe or TensorFlow – which utilizes CPU and GPU for training computations and provides knowledge for framework usage. These frameworks typically provide CNN graphical examples that can serve as starting points. Deep learning frameworks can fine-tune the graphics. To achieve optimal accuracy, layers can be added, removed, or modified.

Facial Expression Analysis Using Deep Learning and Computer Vision Figure 3. CNN Training Phase

One of the biggest challenges in the training phase is finding correctly labeled datasets to train the network. The accuracy of deep networks highly depends on the distribution and quality of the training data. Several options to consider for facial analysis include emotion-labeled datasets from the Facial Expression Recognition Challenge (FREC) and multi-labeled private datasets from VicarVision (VV).

The deployment phase (see Figure 4) targets real-time embedded designs and can be implemented on embedded vision processors, such as Synopsys DesignWare® EV6x embedded vision processors with programmable CNN engines. Embedded vision processors are the best choice for balancing performance with small area and lower power consumption.

Facial Expression Analysis Using Deep Learning and Computer Vision Figure 4. CNN Deployment Phase

Scalar units and vector units are programmed using C and OpenCL C (for vectorization), while the CNN engine does not need to be programmed manually. The final graphics and weights (coefficients) from the training phase can be transferred to the CNN mapping tool, and the CNN engine of the embedded vision processor can be configured for facial analysis at any time.

Images or video frames captured from cameras and image sensors are sent to the embedded vision processor. In recognition scenarios where significant changes in lighting conditions or facial poses occur, CNNs can struggle, so preprocessing of the images can make the faces more uniform. Advanced embedded vision processors and CNNs with heterogeneous architectures allow the CNN engine to classify images, while vector units preprocess the next image – correcting lighting, scaling images, rotating planes, etc., while scalar units handle decisions (i.e., how to process CNN detection results).

Image resolution, frame rate, number of layers, and expected accuracy must all be considered along with the required number of parallel multiply-accumulate operations and performance requirements. The Synopsys EV6x embedded vision processor with CNN can operate at a rate of 800MHz using a 28nm process technology while providing up to 880 MAC performance.

Once the CNN is configured and trained to detect emotions, it can be more easily reconfigured to handle facial analysis tasks, such as determining age range, recognizing gender or ethnicity, and identifying hairstyles or whether someone is wearing glasses.

Summary

CNNs running on embedded vision processors open up new realms of visual processing. Soon, electronic devices that can parse emotions will be commonplace, such as toys that detect happy emotions and electronic teachers that can determine student comprehension by recognizing facial expressions. The combination of deep learning, embedded vision processing, and high-performance CNNs will soon turn this vision into reality.

Recommended Reading

Amazing! Binary Neural Network Performance Improved by 1000 Times!

MIT Genius Crowdfunds Software-Defined Radio Open Source Board

Vivado Design Suite 2017.1 Tips and Tricks

Step-by-Step Guide to Running “Hello World” on FPGA Instances

[Video] Machine Learning and Analytics Implementation in Industrial IoT

Xilinx VCU118 Evaluation Kit Achieves Four-Channel 28Gbps Optical High-Speed Data Transmission

NI MIMO Prototype Verification System Hardware Introduction

A Powerful Tool Has Arrived, Exploring New IoT Applications Depends on It!

NI Millimeter-Wave Transceiver System Hardware Introduction

[HLS Video Tutorial 20]: Array Optimization – Array Partitioning

4:1 Lossless Video Compression? How Is It Done?

Old Antiques vs. New Technology! Amazing Information Found After Disassembling the 2001 Phantom v5 Camera

Xilinx Widely Deploys Dynamic Reconfiguration Technology

More Functional Modules Seamlessly Connected: PXIe700 Board Expands Functionality via FMC HPC

[Video] PureLifi: A Wireless Communication Method Achieved Through LED Lights

Ultra-Low Latency of 2.5 Microseconds! TrustNode Board-Level SDN Router Now Available

[Download] UltraFast Design Method Quick Reference Guide (UG1231)

Professional RF Learning Module: Zynq SoC Based ADALM-PLUTO SDR USB Learning Module

Xilinx Spartan-7 Finally Available for Order!

Facial Expression Analysis Using Deep Learning and Computer Vision

Leave a Comment Cancel reply