Face and Gesture Tracking Applications Signal Positive Things for Computer Vision
| By: Dan McCarthy, Contributing Editor
While computers have become radically more intelligent over the past 30 years, they haven’t become more aware. They still largely rely on a human to take the first step toward engagement. Siri, Alexa, and other computerized voice assistants notwithstanding, most human-computer interactions (HCI) still involve decades-old technology: a mouse, a keyboard, a touchscreen. But while voice assistants offer us a new way to engage with computers, face and gesture recognition technology promises to expand how computers engage with us.
While several technologies enable computers to “see” human features and gestures, computer vision will likely be a driving force for this sector. The research firm MarketsandMarkets forecasts that the global market for facial recognition technology will more than double, from $3.2 billion in 2019 to $7 billion by 2024, at a compound annual growth rate (CAGR) of 16.6%.
The firm tracks gesture recognition separately and projects that it will grow at a CAGR of 29.63% from 2017 to 2022, when the company predicts, the market will reach nearly $19 billion. Such numbers spell significant growth opportunities for vision-based depth-sensing technologies, especially those that find ways to minimize system cost and footprint.
Face Reality With 3D Cameras
Facial recognition is typically associated with security applications designed to distinguish individuals from vast datasets of faces that are either structured (e.g. a law enforcement database) or unstructured (e.g. a crowded airport). But in the context of HCI, such as enabling a smartphone to correctly identify its owner, facial recognition can employ moderately simpler embedded imaging technology.
The keyword is “moderately.” The Samsung Galaxy Note 8 was the first smartphone model to integrate facial recognition as a useful security feature, leveraging the device’s embedded image sensor to build a two-dimensional (2D) image map of a user’s face. It then combined those details with data from an embedded infrared iris scanning sensor. Despite the dual sensor design, after the Note 8’s launch, a group of hackers quickly demonstrated that its 2D facial recognition sensor could be fooled by images captured by a digital camera with a 200-millimeter lens from five meters away.
Apple’s Face ID system — deployed with the company’s iPhone X series — employs a more secure, albeit more expensive 3D approach. It utilizes an infrared camera, depth sensor, and dot projector to map 30,000 points on a user’s face. Embedded software then creates an artificial 3D scan that is much harder to hack with a photograph. Smartphone models from Xiaomi, OPPO, and Huawei apply a similar 3D scan approach, using infrared emitters to create a point cloud of a face.
It should be no surprise to see so many Chinese phone makers favoring more secure 3D technology, as citizens in China increasingly rely on their phones to make point-of-sale purchases. Indeed, facial recognition has moved beyond the phone in China, where more and more citizens can purchase goods, buy subway tickets, or check into hotels simply by displaying their faces. Such applications have not yet taken root in the West. But as we highlighted here in December, retailers and marketers on this side of the Pacific are leveraging vision-based facial recognition to gather business intelligence and enable unique customer experiences.
While facial recognition (generally) compares a static captured pattern to a static stored pattern, gesture recognition systems must process complicated dynamic human movements. Such systems range from tracking fixed hand gestures communicated through a controller glove to full-body skeletal tracking in the vein of Microsoft’s Kinect system for the Xbox. Now defunct, the Kinect remains emblematic of vision-based gesture recognition systems in its basic architecture. It captured 3D motion through the application of a VGA camera, a depth sensor based on a near-infrared emitter, and a monochromatic CMOS sensor.
Today, most vision-based skeletal tracking in the works continues to build on infrared light and depth sensors to capture the articulation points of human limbs as well as their relative position to one another. Using depth cameras of any kind enables a skeletal tracking system to disambiguate between overlapping or occluded objects or limbs. It also reduces the influence of different lighting conditions. Image analysis software can then draw lines between all identified joints to form a dynamically moving whole. Skeletal tracking needn’t apply to the whole body. It may focus on the movement of fingers on a single hand.
Obviously, system complexity, computation, and power consumption all increase in proportion to what skeletal systems must track, which poses a challenge when embedding gesture tracking in compact consumer electronics. In response, semiconductor suppliers are designing high-speed ASIC or DSP chips that integrate tracking software at the chip level.
But even the 2D sensors in today’s smartphones have been shown to be capable of basic, but useful, gesture tracking applications.
Tracking in Three Dimensions
Samsung’s SelfieType project made headlines at this year’s Consumer Electronics Show (CES) with the revelation that the S10 smartphone’s 10-MP front-facing camera and native computer chip were enough to enable an invisible projector keyboard. Essentially a gesture recognition app, SelfieType allows you to prop your phone up like a display and then “type” on any flat surface immediately in front of it, as though there was a QWERTY keyboard at your fingertips. The phone camera and a proprietary AI engine convert your finger motions into text.
As attention-grabbing as SelfieType was, most developers of vision-based gesture recognition continue to bank long term on 3D depth-sensing based either on structured light, stereoscopic imaging, or time-of-flight technology. All leverage near-infrared light sources to support varying light conditions, and most incorporate bandpass filters to enhance the image by allowing only the IR emitter’s specific wavelengths to reach the detector.
Depth sensors designed for gesture tracking and other applications have begun to appear on smartphones. Most often these rely on stereoscopic technology that resolves image depth on a pixel-by-pixel basis by comparing the disparities in image data captured by two embedded sensors. But more sophisticated time-of-flight technology from Sony, LUCID/Helios, and others are beginning to appear on high-end smartphone models from Samsung, OPPO, Honor, and LG.
Gesture recognition is also finding traction in automotive applications. Sony’s DepthSense time-of-flight (ToF) sensors, for example, now power the gesture recognition features inside the BMW 7 Series, allowing drivers to raise or lower radio volume, accept or reject phone calls, set the navigation for home, and exercise other controls.
In another development at CES, Cerence displayed its Drive platform, a vision-based system aimed at upgrading the driving experience. In addition to powering facial recognition too, say, recognize a particular user when they climb into the driver’s seat and dial-up their favorite playlist, the system also tracks eye movement, hand gestures, and voice commands to streamline controls. Drivers can glance to one side, for example, and say, “Close that window,” or point at a landmark outside the car and request more information about it from the vehicle’s voice assistant.
Thumbs-Up for Vision
Enabling computers to identify and track human features still offers more promise than profits to date. In addition to the applications listed here, researchers are exploring the potential of touchless controls in augmented and virtual reality systems, surgical theaters, industrial automation, and aerospace and defense.
In terms of hardware, vision components already provide a solid foundation for the success of many applications, especially as suppliers continue to develop more cost-effective, compact sensors with improved resolution and depth of field. The limiting factor — particularly for gesture tracking — may come down to the computing power required to perform the complex image analysis, motion modeling, and pattern recognition necessary to accurately interpret human gestures. However, as vision engineers continue to evolve their understanding of neural networks and deep learning, they will be well-positioned to help resolve these challenges as well.