These New Tricks Can Outsmart Deepfake Videos—for Now
For weeks, computer scientist Siwei Lyu had watched his team’s deepfake videos with a gnawing sense of unease. Created by a machine learning algorithm, these falsified films showed celebrities doing things they’d never done. They felt eerie to him, and not just because he knew they’d been ginned up. “They don’t look right,” he recalls thinking, “but it’s very hard to pinpoint where that feeling comes from.”
Finally, one day, a childhood memory bubbled up into his brain. He, like many kids, had held staring contests with his open-eyed peers. “I always lost those games,” he says, “because when I watch their faces and they don’t blink, it makes me very uncomfortable.”
These lab-spun deepfakes, he realized, were needling him with the same discomfort: He was losing the staring contest with these film stars, who didn’t open and close their eyes at the rates typical of actual humans.
To find out why, Lyu, a professor at the University of Albany, and his team dug into every step in the software, called DeepFake, that had created them.
Deepfake programs pull in lots of images of a particular person—you, your ex-girlfriend, Kim Jong-un—to catch them at different angles, with different expressions, saying different words. The algorithms learn what this character looks like, and then synthesize that knowledge into a video showing that person doing something he or she never did. Make porn. Make Stephen Colbert spout words actually uttered by John Oliver. Provide a presidential meta-warning about fake videos.
These fakes, while convincing if you watch a few seconds on a phone screen, aren’t perfect (yet). They contain tells, like creepily ever-open eyes, from flaws in their creation process. In looking into DeepFake’s guts, Lyu realized that the images that the program learned from didn’t include many with closed eyes (after all, you wouldn’t keep a selfie where you were blinking, would you?). “This becomes a bias,” he says. The neural network doesn’t get blinking. Programs also might miss other “physiological signals intrinsic to human beings,” says Lyu’s paper on the phenomenon, such as breathing at a normal rate, or having a pulse. (Autonomic signs of constant existential distress are not listed.) While this research focused specifically on videos created with this particular software, it is a truth universally acknowledged that even a large set of snapshots might not adequately capture the physical human experience, and so any software trained on those images may be found lacking.
Lyu’s blinking revelation revealed a lot of fakes. But a few weeks after his team put a draft of their paper online, they got anonymous emails with links to deeply faked YouTube videos whose stars opened and closed their eyes more normally. The fake content creators had evolved.
Of course they had. As Lyu noted in a piece for The Conversation, “blinking can be added to deepfake videos by including face images with closed eyes or using video sequences for training.” Once you know what your tell is, avoiding it is “just” a technological problem. Which means deepfakes will likely become (or stay) an arms race between the creators and the detectors. But research like Lyu’s can at least make life harder for the fake-makers. “We are trying to raise the bar,” he says. “We want to make the process more difficult, more time-consuming.”
Because right now? It’s pretty easy. You download the software. You Google “Hillary Clinton.” You get tens of thousands of images. You funnel them into the deepfake pipeline. It metabolizes them, learns from them. And while it’s not totally self-sufficient, with a little help, it gestates and gives birth to something new, something sufficiently real.
“It is really blurry,” says Lyu. He doesn’t mean the images. “The line between what is true and what is false,” he clarifies.
That’s as concerning as it is unsurprising to anyone who’s been alive and on the internet lately. But it’s of particular concern to the military and intelligence communities. And that’s part of why Lyu’s research is funded, along with others’ work, by a Darpa program called MediFor—Media Forensics.
MediFor started in 2016 when the agency saw the fakery game leveling up. The project aims to create an automated system that looks at three levels of tells, fuses them, and comes up with an “integrity score” for an image or video. The first level involves searching for dirty digital fingerprints, like noise that’s characteristic of a particular camera model, or compression artifacts. The second level is physical: Maybe the lighting on someone’s face is wrong, or a reflection isn’t the way it should be given where the lamp is. Lastly, they get down to the “semantic level”: comparing the media to things they know are true. So if, say, a video of a soccer game claims to come from Central Park at 2 pm on Tuesday, October 9, 2018, does the state of the sky match the archival weather report? Stack all those levels, and voila: integrity score. By the end of MediFor, Darpa hopes to have prototype systems it can test at scale.
But the clock is ticking (or is that just a repetitive sound generated by an AI trained on timekeeping data?). “What you might see in a few years’ time is things like fabrication of events,” says Darpa program manager Matt Turek. “Not just a single image or video that’s manipulated but a set of images or videos that are trying to convey a consistent message.”
Over at Los Alamos National Lab, cyber scientist Juston Moore’s visions of potential futures are a little more vivid. Like this one: Tell an algorithm you want a picture of Moore robbing a drugstore; implant it in that establishment’s security footage; send him to jail. In other words, he’s worried that if evidentiary standards don’t (or can’t) evolve with the fabricated times, people could easily be framed. And if courts don’t think they can rely on visual data, they might also throw out legitimate evidence.
Taken to its logical conclusion, that could mean our pictures end up worth zero words. “It could be that you don’t trust any photographic evidence anymore,” he says, “which is not a world I want to live in.”
That world isn’t totally implausible. And the problem, says Moore, goes far beyond swapping one visage for another. “The algorithms can create images of faces that don’t belong to real people, and they can translate images in strange ways, such as turning a horse into a zebra,” says Moore. They can “imagine away” parts of pictures, and delete foreground objects from videos.
Maybe we can’t combat fakes as fast as people can make better ones. But maybe we can, and that possibility motivates Moore’s team’s digital forensics research. Los Alamos’s program—which combines expertise from its cyber systems, information systems, and theoretical biology and biophysics departments—is younger than Darpa’s, just about a year old. One approach focuses on “compressibility,” or times when there’s not as as much information in an image as there seems to be. “Basically we start with the idea that all of these AI generators of images have a limited set of things they can generate,” Moore says. “So even if an image looks really complex to you or me just looking at it, there’s some pretty repeatable structure.” When pixels are recycled, it means there’s not as much there there.
They’re also using sparse coding algorithms to play a kind of matching game. Say you have two collections: a bunch of real pictures, and a bunch of made-up representations from a particular AI. The algorithm pores over them, building up what Moore calls “a dictionary of visual elements,” namely what the fictional pics have in common with each other and what the nonfictional shots uniquely share. If Moore’s friend retweets a picture of Obama, and Moore thinks maybe it’s from that AI, he can run it through the program to see which of the two dictionaries—the real or the fake—best defines it.
Los Alamos, which has one of the world’s most powerful supercomputers, isn’t pouring resources into this program just because someone might want to frame Moore for a robbery. The lab’s mission is “to solve national security challenges through scientific excellence.” And its core focus is nuclear security—making sure bombs don’t explode when they’re not supposed to, and do when they are (please no), and aiding in nonproliferation. That all requires general expertise in machine learning, because it helps with, as Moore says, “making powerful inferences from small datasets.”
But beyond that, places like Los Alamos need to be able to believe—or, to be more realistic, to know when not to believe—their eyes. Because what if you see satellite images of a country mobilizing or testing nuclear weapons? What if someone synthesized sensor measurements?
That’s a scary future, one that work like Moore’s and Lyu’s will ideally circumvent. But in that lost-cause world, seeing is not believing, and seemingly concrete measurements are mere creations. Anything digital is in doubt.
But maybe “in doubt” is the wrong phrase. Many people will take fakes at face value (remember that picture of a shark in Houston?), especially if its content meshes with what they already think. “People will believe whatever they’re inclined to believe,” says Moore.
That’s likely more true in the casual news-consuming public than in the national security sphere. And to help halt the spread of misinformation among us dopes, Darpa is open to future partnerships with social media platforms, to help users determine that that video of Kim Jong-un doing the macarena has low integrity. Social media can also, Turek points out, spread a story debunking a given video as quickly as it spreads the video itself.
But even if no one could change the masses’ minds about a video’s veracity, it’s important that the people making political and legal decisions—about who’s moving missiles or murdering someone—try to machine a way to tell the difference between waking reality and an AI dream.