2021 AI safety:It’s immediately clear how nuclear bombs will kill us. No one working on mitigating nuclear risk has to start by explaining why it’d be a bad thing if we had a nuclear war.The case that AI could pose an existential risk to humanity is more complicated and harder to grasp. So many of the people who are working to build safe AI systems have to start by explaining why AI systems, by default, are dangerous.
The idea that AI can become a danger is rooted in the fact that AI systems pursue their goals, whether or not those goals are what we really intended — and whether or not we’re in the way. “You’re probably not an evil ant-hater who steps on ants out of malice,” Stephen Hawking wrote, “but if you’re in charge of a hydroelectric green-energy project and there’s an anthill in the region to be flooded, too bad for the ants. Let’s not place humanity in the position of those ants.”
Here’s one scenario that keeps experts up at night: We develop a sophisticated AI system with the goal of, say, estimating some number with high confidence. The AI realizes it can achieve more confidence in its calculation if it uses all the world’s computing hardware, and it realizes that releasing a biological superweapon to wipe out humanity would allow it free use of all the hardware. Having exterminated humanity, it then calculates the number with higher confidence.
It is easy to design an AI that averts that specific pitfall. But there are lots of ways that unleashing powerful computer systems will have unexpected and potentially devastating effects, and avoiding all of them is a much harder problem than avoiding any specific one.
Victoria Krakovna, an AI researcher at DeepMind (now a division of Alphabet, Google’s parent company), compiled a list of examples of “specification gaming”: the computer doing what we told it to do but not what we wanted it to do. For example, we tried to teach AI organisms in a simulation to jump, but we did it by teaching them to measure how far their “feet” rose above the ground. Instead of jumping, they learned to grow into tall vertical poles and do flips — they excelled at what we were measuring, but they didn’t do what we wanted them to do.
An AI playing the Atari exploration game Montezuma’s Revenge found a bug that let it force a key in the game to reappear, thereby allowing it to earn a higher score by exploiting the glitch. An AI playing a different game realized it could get more points by falsely inserting its name as the owner of high-value items.
Sometimes, the researchers didn’t even know how their AI system cheated: “the agent discovers an in-game bug. … For a reason unknown to us, the game does not advance to the second round but the platforms start to blink and the agent quickly gains a huge amount of points (close to 1 million for our episode time limit).”
What these examples make clear is that in any system that might have bugs or unintended behavior or behavior humans don’t fully understand, a sufficiently powerful AI system might act unpredictably — pursuing its goals through an avenue that isn’t the one we expected.
In his 2009 paper “The Basic AI Drives,” Steve Omohundro, who has worked as a computer science professor at the University of Illinois Urbana-Champaign and as the president of Possibility Research, argues that almost any AI system will predictably try to accumulate more resources, become more efficient, and resist being turned off or modified: “These potentially harmful behaviors will occur not because they were programmed in at the start, but because of the intrinsic nature of goal driven systems.”
His argument goes like this: Because AIs have goals, they’ll be motivated to take actions that they can predict will advance their goals. An AI playing a chess game will be motivated to take an opponent’s piece and advance the board to a state that looks more winnable.
But the same AI, if it sees a way to improve its own chess evaluation algorithm so it can evaluate potential moves faster, will do that too, for the same reason: It’s just another step that advances its goal.
If the AI sees a way to harness more computing power so it can consider more moves in the time available, it will do that. And if the AI detects that someone is trying to turn off its computer mid-game, and it has a way to disrupt that, it’ll do it. It’s not that we would instruct the AI to do things like that; it’s that whatever goal a system has, actions like these will often be part of the best path to achieve that goal.
That means that any goal, even innocuous ones like playing chess or generating advertisements that get lots of clicks online, could produce unintended results if the agent pursuing it has enough intelligence and optimization power to identify weird, unexpected routes to achieve its goals.
Goal-driven systems won’t wake up one day with hostility to humans lurking in their hearts. But they will take actions that they predict will help them achieve their goal — even if we’d find those actions problematic, even horrifying. They’ll work to preserve themselves, accumulate more resources, and become more efficient. They already do that, but it takes the form of weird glitches in games. As they grow more sophisticated, scientists like Omohundro predict more adversarial behavior.
When did scientists first start worrying about AI risk?
Scientists have been thinking about the potential of artificial intelligence since the early days of computers. In the famous paper where he put forth the Turing test for determining if an artificial system is truly “intelligent,” Alan Turing wrote:
Let us now assume, for the sake of argument, that these machines are a genuine possibility, and look at the consequences of constructing them. … There would be plenty to do in trying to keep one’s intelligence up to the standards set by the machines, for it seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. … At some stage therefore we should have to expect the machines to take control.
I.J. Good worked closely with Turing and reached the same conclusions, according to his assistant, Leslie Pendleton. In an excerpt from unpublished notes Good wrote shortly before he died in 2009, he writes about himself in third person and notes a disagreement with his younger self — while as a younger man, he thought powerful AIs might be helpful to us, the older Good expected AI to annihilate us.
[The paper] “Speculations Concerning the First Ultra-intelligent Machine” (1965) … began: “The survival of man depends on the early construction of an ultra-intelligent machine.” Those were his words during the Cold War, and he now suspects that “survival” should be replaced by “extinction.” He thinks that, because of international competition, we cannot prevent the machines from taking over. He thinks we are lemmings. He said also that “probably Man will construct the deus ex machina in his own image.”
In the 21st century, with computers quickly establishing themselves as a transformative force in our world, younger researchers started expressing similar worries.
Nick Bostrom is a professor at the University of Oxford, the director of the Future of Humanity Institute, and the director of the Governance of Artificial Intelligence Program. He researches risks to humanity, both in the abstract — asking questions like why we seem to be alone in the universe — and in concrete terms, analyzing the technological advances on the table and whether they endanger us. AI, he concluded, endangers us.
In 2014, he wrote a book explaining the risks AI poses and the necessity of getting it right the first time, concluding, “once unfriendly superintelligence exists, it would prevent us from replacing it or changing its preferences. Our fate would be sealed.”
Across the world, others have reached the same conclusion. Bostrom co-authored a paper on the ethics of artificial intelligence with Eliezer Yudkowsky, founder of and research fellow at the Berkeley Machine Intelligence Research Institute (MIRI), an organization that works on better formal characterizations of the AI safety problem.
Yudkowsky started his career in AI by worriedly poking holes in others’ proposals for how to make AI systems safe, and has spent most of it working to persuade his peers that AI systems will, by default, be unaligned with human values (not necessarily opposed to but indifferent to human morality) — and that it’ll be a challenging technical problem to prevent that outcome.
Increasingly, researchers realized that there’d be challenges that hadn’t been present with AI systems when they were simple. “‘Side effects’ are much more likely to occur in a complex environment, and an agent may need to be quite sophisticated to hack its reward function in a dangerous way. This may explain why these problems have received so little study in the past, while also suggesting their importance in the future,” concluded a 2016 research paper on problems in AI safety.
Bostrom’s book Superintelligence was compelling to many people, but there were skeptics. “No, experts don’t think superintelligent AI is a threat to humanity,” argued an op-ed by Oren Etzioni, a professor of computer science at the University of Washington and CEO of the Allan Institute for Artificial Intelligence. “Yes, we are worried about the existential risk of artificial intelligence,” replied a dueling op-ed by Stuart Russell, an AI pioneer and UC Berkeley professor, and Allan DaFoe, a senior research fellow at Oxford and director of the Governance of AI program there.
It’s tempting to conclude that there’s a pitched battle between AI-risk skeptics and AI-risk believers. In reality, they might not disagree as profoundly as you would think.
Facebook’s chief AI scientist Yann LeCun, for example, is a vocal voice on the skeptical side. But while he argues we shouldn’t fear AI, he still believes we ought to have people working on, and thinking about, AI safety. “Even if the risk of an A.I. uprising is very unlikely and very far in the future, we still need to think about it, design precautionary measures, and establish guidelines,” he writes.
That’s not to say there’s an expert consensus here — far from it. There is substantial disagreement about which approaches seem likeliest to bring us to general AI, which approaches seem likeliest to bring us to safe general AI, and how soon we need to worry about any of this.
Many experts are wary that others are overselling their field, and dooming it when the hype runs out. But that disagreement shouldn’t obscure a growing common ground; these are possibilities worth thinking about, investing in, and researching, so we have guidelines when the moment comes that they’re needed.