IT IS not just that an AI will want to look after itself. An AI will want to make sure that it fulfils its goals, and an important part of that is making sure that its goals stay the same.
We humans are relatively relaxed about our plans changing in the future. We change our career goals, we change our minds about wanting children, we change our minds about all sorts of things, and we aren’t usually appalled at the idea.
That said, sometimes humans do take steps to bind Future You to Present You’s bidding. Present You might want to lose weight, say, and not trust Future You to stick to the diet. So Present You might throw away all the chocolate bars you keep in a kitchen drawer.
Or Present You might want to finish an important presentation over the weekend, but not trust Future You not to just faff around on the internet all day; so Present You sets up a website blocker that stops your browser going on Twitter.
Or, of course, Present Odysseus might want to listen to the song of the sirens as he sails past their island, but not trust Future Odysseus not to sail his ship on to the rocks when he hears them. So Odysseus might order his crew to stuff their ears with beeswax, then tie himself to the mast, ordering them to ignore his cries as they go past.
What Odysseus is doing, in AI terms, is maximising his expected utility: taking the actions he thinks are most likely to achieve his goals — given a utility function of something like “10 if you get home to Ithaca; 0 if you run aground on the rocks because you heard the Sirens; but also 1 if you get to hear the lovely Siren song on the way.”
AN AI will want to maximise its expected utility, too, in a much more explicitly defined way. If it’s the broom out of “The Sorcerer’s Apprentice”, it’ll want to do whatever it thinks is most likely to lead to the cauldron’s being full.
One action that will probably not lead to the cauldron’s being full would be “stop caring about whether the cauldron is full.” Present AI will want to make sure that Future AI cares about the same things that it cares about.
Present Odysseus knew that, when he heard the Sirens’ song, he would stop caring about getting home to Ithaca — the sirens would have rewritten his utility function — so he couldn’t leave decisions about where to go in the hands of Future Odysseus.
A cauldron-filling AI would not want a human to rewrite its utility function, because any change to that will probably make it less likely to fulfil its utility function. Attempts to reprogram the AI will not be popular with the AI, for the same reason that [in the telling of “The Sorcerer’s Apprentice” in Disney’s Fantasia] Mickey’s attempts to smash the broom with a big axe were not popular with the broom.
An AI’s utility function “encapsulates [its] values, and any changes to it would be disastrous to [it],” the AI researcher Steve Omohundro writes. “Imagine a book-loving agent whose utility function was changed by an arsonist to cause the agent to enjoy burning books. Its future self not only wouldn’t work to collect and preserve books, but would actively go about destroying them.” He describes this as a “fate worse than death” for the AI.
IF AIs would want to preserve their utility function (and certainly Nick Bostrom, Omohundro, and most of the AI people I spoke to think they would), then that makes it less likely that a future AI will reach superintelligence and think: “These goals are pretty silly; maybe I should do something else,” and thus not turn us all into paperclips.
I asked Paul Crowley, a cryptography engineer on Google’s Android phone — whose “chief preoccupation in life is helping humanity reach the stars without first being destroyed by its own technological success” — about that, while eating an intimidatingly large omelette in a diner in Mountain View.
“Take Deep Blue” (the computer that beat Gary Kasparov in a chess game),” he said. “Insofar as Deep Blue values anything, it values winning at chess, and nothing else at all.”
But imagine that some super-Deep Blue in the future becomes superintelligent, turning the whole of the solar system into its databanks to work ever harder at how to win at chess. There’s no reason to imagine that it would, at any point, suddenly change and become more human in its thinking — “At what stage would it go, ‘Wait a second, maybe there’s something more important?’” Crowley says.
But even if it did, it wouldn’t help. “If this super-Deep Blue caught itself thinking, ‘In my unbelievable wisdom that I have gained through taking over the whole of Jupiter and turning it into a computer, I have started to sense that there is something more
important than chess in the universe,’” Crowley says, “then immediately it would go, ‘I’d better make sure I never think this kind of thing again, because if I do then I’d stop valuing winning at chess. And that won’t help me win any chess games, will it now?’”
THIS isn’t too alien to us. If someone were to say to me: “I will take your children away, but first I will change your value system so that you don’t care about them,” I would resist, even though Future Me — childless but uncaring — would presumably be entirely happy with the situation. Some things are sacred to us, and we would not want to stop caring about them.
This isn’t necessarily terrible. Murray Shanahan pointed out to me that you really don’t want an AI to change its goals, except in a very carefully defined set of circumstances. “You could easily make something that overwrites its own goals,” he said. “You could write a bit of code that randomly scrambles its reward function to something else.”
But that wouldn’t, you imagine, be very productive. For a start, you’ve presumably created this AI to do something. If your amazing cancer-curing AI stops looking for a cure for cancer after three days, randomly scrambles its utility function, and starts caring very deeply about ornithology, for example, then it’s not much use to you, even if it doesn’t accidentally destroy the universe, which it might.
“Step number one to making it safe is making sure its reward function is stable,” Shanahan said. “And we can probably do that.”
But there may be times when we don’t want it to stay the same. Our values change over time. Holden Karnofsky, whose organisation OpenPhil supports a lot of AI safety research, pointed that out to me. “Imagine if we took the values of 1800 AD,” he said. If an AI had been created then (Charles Babbage was working on it, sort of), and had become superintelligent and world-dominating, then would we want it to stay eternally the same?
“If we entrenched those values for ever; if we said: ‘We really think the world should work this way, and so that’s the way we want the world to work for ever,’ that would have been really bad.” We will probably feel much the same way about the values of 2019 in 200 years’ time, assuming that we last that long.
And, more starkly, if we get the values we instil in it slightly wrong, according to the people who worry about these things, it’s not just that it’ll entrench the ideals of a particular time, or that it will not be good at its job. It’s that (as we’ve discussed) it could destroy everything that we value, in the process of finding the most efficient way of maximising whatever it values.
This is an edited extract from The AI Does Not Hate You by Tom Chivers, published by Weidenfeld & Nicolson at £16.99 (CT Bookshop £15.30).
Listen to an interview with Tom Chivers on the Church Times Podcast
How is your utility function?