The secret to super smart A.I. may be hidden inside an Atari cartridge

Artificial intelligence has flexed its computational prowess in recent years by besting humans at games of smarts like the Go or even on the game show Jeopardy!

Now scientists are exploring yet another final-frontier of human achievement: a backlog of Atari platformers.

When it comes to navigating complex environments, video games are a perfect playground for testing out new-and-improved algorithmic approaches. In a study published Wednesday in the journal Nature, a team of researchers tested a family of algorithms they call Go-Explore on notoriously tricky Atari games, including Montezuma’s Revenge and Pitfall.

The researchers report Go-Explore not only performed with "super-human" ability but also bested existing algorithms that had also attempted to defeat these games.

Why it matters — Far beyond simply smashing dusty Atari cartridges, the authors suggest algorithms like Go-Explore, which are especially good for maneuvering complex environments, could be the future of "generally intelligent agents" that can advance drug development, robotics, and more.

Here's the background — When it comes to explaining the problems plaguing algorithms that explore twisting and turning worlds — like the algorithms inside your Roomba — a refrigerator is a great example of what can go wrong, the study team explains.

The key friction is between giving an A.I. enough reward to complete a test (like moving toward a fridge) or supplying rewards that may be "deceptive."

"[T]o guide a robot to a refrigerator, one might provide a reward only when the refrigerator is reached, but doing so makes the reward ‘sparse’ if many actions are required to reach the refrigerator," write the authors. As a result, an algorithm might not be properly motivated to reach its goal.

They write:

"Unfortunately, a denser reward (for example, the Euclidean distance to the refrigerator) can be ‘deceptive’; naively following the reward function may lead the robot into a dead end and can also produce unintended (and potentially unsafe) behavior (for example, the robot not detouring around obstacles like pets).

Essentially, if an algorithm's reward system is not perfectly outlined, it may fail its task altogether.

Yet even within algorithms that better account for sparse or dense rewards, two problems remain:

  • An algorithm can become detached (meaning it prematurely stops returning to certain areas)
  • An algorithm can become derailed (meaning it fails to first return to a state before exploring from it)

Turns out acing an old-school Atari game is good for science

To solve this problem, the team developed a memory trick to help Go-Explore remember where its been and help it best explore a new environment without missing any hidden corners — kind of like embedding save points in its memory.

What they did — Go-Explore's algorithm works by systematically cataloging every part of the environment it has visited in an easily accessible archive.

The algorithm also makes use of a 'go and return' approach, meaning it will always return to a previously explored location (similar to a save point) after exploring somewhere new. Taking this approach, instead of having the A.I. also explore on the way back, means it's less likely to get lost along the way.

With this algorithm design in place, the A.I. was let loose in a group of 11 Atari games to put its abilities to the test. Ultimately, the A.I. processed 30 billion frames of information while playing these classic games.

In addition to exploring the game environments, the researchers also designed a simulated robot arm tasked with moving cups into locked cabinets to see how this algorithm might fair in the real-world.

What they discovered — When it comes to scoring big on the Atari games, the authors report that their A.I. blew the competition out of the water.

"[T]he mean performance of Go-Explore is both superhuman and surpasses the state of the art in all 11 games," write the authors. For Montezuma's Revenge, in particular, the A.I. had a mean score of 1.7 million, far above the human high-score of 1.2 million.

As for the simulated robot and cups, the authors report that the A.I. was able to quickly discover a successful trajectory for putting away the objects. However, while the A.I. did a good job exploring this realistic environment, it still failed to achieve certain tasks within it reliably, such as grasping the cups.

See also: How Elon Musk's A.I. destroyed the world's best gamers in "DoTA 2"

What's next — Hitting new high scores on decades-old video games isn't likely to save the world, but the authors write that this algorithm could take on new importance in pretty life-saving ways — including in advanced drug development. It's possible the algorithm could use similar exploration skills to explore a chemical landscape.

To reach these goals, future iterations of this research will need to improve the generality of the Go-Explore algorithms as well as their efficiency.

Abstract: Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly ‘remembering’ promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders-of-magnitude improvements on the grand challenges of Montezuma’s Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore’s exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration—an insight that may prove critical to the creation of truly intelligent learning agents.
Related Tags