Anyone Can Make Vladimir Putin Smile

Will this latest version of facial motion capture mean we can't trust online video anymore? Not necessarily. But it is pretty fast. Screenshot from Face2Face YouTube Video
Will this latest version of facial motion capture mean we can't trust online video anymore? Not necessarily. But it is pretty fast. Screenshot from Face2Face YouTube Video

Apparently, all it takes to make Vladimir Putin smile is someone else smiling, a webcam and good graphics hardware.

Computer scientists from the University of Erlangen-Nuremberg, the Max Planck Institute for Informatics and Stanford University have developed an approach to facial motion capture that achieves what the team calls "photorealistic" results using equipment most of us have at home. And it does it in real time.


Some are calling it the death knell of online video as evidence. Others are just impressed: The demo video for the system, called Face2Face, is fairly remarkable.

In the March 2016 post, the researchers show what their software can do with a live "actor," a webcam and YouTube footage of the usually stoic Russian president. As the actor goes through a range of facial expressions in front of a webcam, each expression simultaneously appears on Putin's face in the video. When the actor makes fish-like mouth movements, Putin makes fish-like mouth movements. When the actor suddenly smiles, Putin does, too. And it all looks real.  


Motion Capture and Re-enactment

Motion capture, or mocap, is essentially the process of turning a live person's movements into computer data and then applying that data to a different, digitized form. The technology has various applications, including in sports training and medicine, but it's probably best known for its movie work.

Like the motion capture systems that animated Gollum in "The Lord of the Rings" and gave Benjamin Button the facial expressions of Brad Pitt, the Face2Face software captures movements from a live source and re-enacts them on a digital target. But in this case, the target isn't computer-generated (though the system can do that, too). It's an actual person's face in a pre-recorded, RGB video.


Full Width
In this one, you can see the source actor (who's live) and the target actor, Putin (pulled from a clip), and how the two mesh in the real-time re-enactment.
Screenshot of Face2Face YouTube Video

Standard cameras are RGB — they record data from red, green and blue color sensors. Motion-capture systems typically use RGB-D cameras (like the Microsoft Kinect), which add a sensor for depth.

According to the project authors, there are other facial motion capture systems that work with RGB footage, but not in real time.


In (Less Than) the Blink of an Eye

As described by co-author Justus Thies, computer science instructor at the University of Erlangen-Nuremberg in Germany, the capture and re-enactment process starts with modeling. The software analyzes webcam images of the source (the live performer) and video footage of the target (Putin), gathering data on facial traits and movements. It only needs about six frames, according to Thies. The software then uses this data to make adjustments to the closest matching synthetic face models in the software's database, producing accurate 3-D models of both faces.

Then, Thies writes in an email, "Knowing the geometry of two persons, we are able to transfer the expressions from one person to the other person based on a new deformation transfer technique." This is unique to Face2Face. It tracks the way the source's face model "deforms" to achieve an expression and applies those same deformations to the target's face model.


Ultimately, the software re-renders the target video using the new, "deformed" face model. Thies says the system runs at about 28 frames per second. This means the entire modeling, capture and re-enactment process takes about 0.04 seconds per video frame. Achieving enhanced realism at that speed is a feat.

"Previous methods that run in real-time use a sparse measurement, e.g., a few feature points around the mouth, eyes and silhouette," Thies writes. Face2Face, on the other hand, looks at every pixel comprising the face.

Thies attributes the ability to do all of this in real time to "efficient implementation on modern GPUs."


Beyond Movies

The authors believe Face2Face may eventually help Hollywood generate more realistic expressions in CG faces, as well as make adjustments to live actors' faces already on film. It's also well-suited to foreign-film dubbing, which could be more widely palatable if the actors' mouth movements match the translated dialogue. (A poster wonders how long we've got before James Dean starts appearing in new roles.)

Yet the more noteworthy applications may be the day-to-day ones. The ability to manipulate faces not only realistically, and in real time, but also in online targets raises some interesting possibilities. Video game avatars could more accurately reflect gamers' facial expressions as they play. In international teleconferences and live TV broadcasts, speakers' mouths could be re-rendered on-the-fly to match their translators' words. The authors see applications in fraud detection, where the software could locate facial inconsistencies "by analyzing the tracked expressions in a video sequence and comparing them to a reference video sequence."


Some of this is a ways off. But the project is still underway.

For now, it can put a big old grin on Vladimir Putin's face in a fraction of a second, and make it look real. It's a score for video doctoring, no doubt — but as Martin Anderson points out on The Stack, until voice simulation achieves the same level of realism, video proof remains relatively safe.