Our inability to read other people has led to some epic high five fails and missed kisses. Even after a lifetime of experience, human interactions are hard to predict. But researchers at MIT’s Computer Science and Artificial Intelligence Laboratory think they can help: With a new deep-learning algorithm that can predict when two people will hug, kiss, shake hands, or high five, they’ve taken a big step toward a future blessedly devoid of those awkward moments.
They’re hoping their new algorithm — trained on 600 hours of YouTube videos and TV shows like The Office, Scrubs, Big Bang Theory, and Desperate Housewives — can be used to program less socially awkward robots and develop Google Glass-style headsets to suggest actions for us before we even have the chance to miss. In the future they’re imagining, you’ll never again mess up a chance to air high-five with your co-worker.
Realizing that robots learn to be social in the same ways we do was key to the algorithm’s success. “Humans automatically learn to anticipate actions through experience, which is what made us interested in trying to imbue computers with the same sort of common sense,” says CSAIL Ph.D. student Carl Vondrick, the first author on a related paper being presented this week at the International Conference on Computer Vision and Pattern Recognition. “We wanted to show that just by watching large amounts of video, computers can gain enough knowledge to consistently make predictions about their surroundings.”
Vondrick and his team taught the algorithm’s multiple “neural networks” to analyze huge amounts of data in this case, hours of Jim and Pam’s high five, and Mike and Susan’s surreptitious kisses, on its own. Taking into account factors like outstretched arms, a raised hand, or a prolonged gaze, each of the neural networks guessed what was going to happen in the next second, and the general consensus of the networks was taken as the final “prediction” in the study.
The algorithm got it right over 43 percent of the time. While that might not seem high enough to guarantee that our day-to-day interactions will be any less weird, it’s a big improvement on existing algorithms, which have a precision of only 36 percent.
Besides, humans can only predict actions 71 percent of the time. We need all the help we can get.
In the second part of the study, the algorithm was taught to predict what object — domestic sitcom staples like remotes, dishes, and trash cans — would appear in the scene five seconds later. For example, if a microwave door is opened, there’s a relatively high chance a mug will appear next.
Their algorithm isn’t accurate enough for Google Glass just yet, but with co-author Antonio Torralba, Ph.D. — funded by a Google faculty research award and Vondrick working with a Google Ph.D. fellowship — we can bet it gets there. Future versions of the algorithm, Vondrick predicts, can be used to program robots to interact with humans or even teach security cameras to register when a person falls or gets injured.
“A video isn’t like a ‘Choose Your Own Adventure’ book where you can see all of the potential paths,” says Vondrick. “The future is inherently ambiguous, so it’s exciting to challenge ourselves to develop a system that uses these representations to anticipate all of the possibilities.”Photos via MIT CSAIL