Nearly every time a major discussion about the future of artificial intelligence arises, the subject of anarchy and robotic overlords tends to come up. The idea of humans and advanced artificial beings living together has spawned hundreds of narratives about how that would work; most recently, the video game Overwatch stunned the world with its telling of our future, some time after a massive human-versus-robot civil war. When a Boston Dynamics engineer kicks one of their four-legged robots for the sake of a stress test, it’s hard not to wonder whether or not the ‘bot will remember that some day.

All of that (along with basic common sense regarding security and accidents) has led a group of researchers to publish a new paper focusing on developing “safely interruptible agents”; systems that will “interrupt” A.I. software if anything goes wrong. The paper was published through the Machine Intelligence Research Institute, and is a study on how to develop the proposed systems. The study and proposal use a reward system as an example, and are actually far more complicated than just hitting the proposed “big red button” detailed on the sheet. Teaching morality to A.I. was a major part of the proposal.

If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button— which is an undesirable outcome. This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator.

Gotta love it whenever research papers about robot anarchy use the term “undesirable outcome” — still, the paper goes on to explain the tests run and what can be done about instances like these in the future. You can read the full paper here.