In May 2020, Professor Stuart Russell delivered our most highly attended Turing Lecture on provably beneficial AI to a virtual audience of over 700 people from around the world. As we received so many excellent questions that we weren’t able to get through in the live Q&A, Professor Russell was kind enough to provide his personal perspective on the questions below.

Stuart Russell is Professor of Computer Science and Smith-Zadeh Professor in Engineering at University of California, Berkeley. His book Artificial Intelligence: A Modern Approach (4th edition, with Peter Norvig) is the standard text in AI and has been translated into 14 languages and is used in over 1400 universities in 128 countries.

His new book, Human Compatible (Penguin, 2019) has sparked debate across the field, and garnered support from industry leaders, such as Elon Musk. The following piece refers to this, and his Turing Lecture “Provably beneficial AI”, which is available on YouTube and below.

The following Q&A has been slightly edited for clarity by the Turing’s Senior Events & Engagement Coordinator Jessie Wand and Press & Communications Manager Beth Wood. Throughout, he uses the word “robot” to refer to any AI system, whether embodied in hardware or existing only in software.

(Stuart Russell headshot credit Noah Berger.)

How should an AI system make decisions in a scenario where immediate choices are required?

In a sense, immediate choices are always required. A robot (or a human) is always doing something, even if that something is to do nothing. I take it the question is getting at the issue of a robot acting without the opportunity to ask permission or gain more information about human preferences. In such cases, there is a trade-off between the potential cost of inaction and the potential cost of doing the wrong thing.

Let’s suppose the robot is at home alone, the house is about to be engulfed in flames from a raging forest fire, and the robot can save the family photo albums or the pet gerbil, but not both. It has no idea which has greater value to the family. Well, what would anyone do?

How can we ensure the preferences are updated as society changes over time? Will AI change the preferences of people if e.g. the AI solution is seen as the norm and people are afraid to voice own preferences?

Robots will always be learning from humans and always aiming to predict what people currently want the future to be like, and even what they will want the future to be like, up until it happens. There is no single “AI solution” as to what human preferences are; there should be eight billion predictive models.

I think people will be positively encouraged to voice their preferences, as that allows AI systems to be more helpful and to avoid accidental infringements. Democracy is supposed to work this way, but we only get to voice about 0.05 bits of information every few years.

How do you feel AI can be controlled to minimise the risk of bias?

There has been a great deal of work recently on this problem, within the subfield of AI concerned with Fairness, Accountability, and Transparency. It’s discussed in Human Compatible (pp.128-130). I think we have a fairly good handle on possible definitions of fairness and on algorithms that conform to those definitions. Organisations are beginning to develop internal practices to use these ideas and it’s quite likely consensus will emerge around industry standards and possibly legislation.

We’ve made less progress on understanding the larger context: how the entire sociotechnical system, of which the AI system is just a part, can produce biased outcomes. Brian Christian’s forthcoming book, The Alignment Problem: Machine Learning and Human Values, has several good examples of this. Another good example is the work of Obermeyer et al., “Dissecting racial bias in an algorithm used to manage the health of populations.”

Read the full list of Q&As below (click on the question text to reveal the answer):

Core aspects of provably beneficial AI

Won’t a level of indecision mean that these AIs would spend their whole time just asking clarifying questions of the human? I’m thinking of an AI assistant that asks so many questions that the human might as well do it themselves.

That’s exactly right—if the robot believes it can ask questions at no cost to humans, it will ask lots of questions. This point is covered in Human Compatible (p.199). Fortunately, the robot can model the cost, whether it arises from annoyance or delay in acting. There is a trade-off between the cost of asking and the value of knowing the answer. (In AI and economics, this is the theory of the value of information; see Artificial Intelligence: A Modern Approach, 4th edition, (p.16.6.) Important questions, such as “Is it OK if I turn the oceans into sulphuric acid?”, are worth asking. When the downside risk is small and the robot is fairly certain it can act safely, it will go ahead. Grumpy humans who hate being interrupted must accept that their robots are more likely to do things that they won’t like. There is no way around this—but it is important to note that the robot does not start out from scratch with each human, knowing nothing about human preferences.

First, it is reasonable to put in some fairly strong prior beliefs saying that humans mostly prefer to be alive, to be healthy, to be safe, to have enough to eat and drink, to know things, to have freedom of action, etc. Second, the robot can access the vast record of human choices evident in the written record. Third, humans have many resemblances to each other, and the preferences of other humans can be helpful in predicting the preferences of a given person.

You point out that the "standard model" of AI is general over reinforcement learning, planning, statistics, and so on. But how general do you think your work on provably beneficial AI is so far? How reliant is it on the agent model of AI and/or reinforcement learning?

The new model proposed in Human Compatible is, in important ways, strictly more general than the standard model. Where the standard model requires a fixed, known objective, the new model allows for uncertainty about the objective. The new model encompasses all the kinds of AI systems developed within the standard model as special cases, but also includes new kinds of interactive, collaborative systems that would be very hard to realise within the standard model. Of course, we need to do a great deal of research to make these systems real and practical.

Three similar questions about the influence of robots on human behaviour:
  • In such an assistance game [with human and machine players], how would you reduce the risks of manipulation of the human operator?
  • What prevents the robot in that scheme to turn persuasive and later on adversarial?
  • How can we manage the influence of the robot on the human behaviour in the assistance game?

The robot’s fundamental objective is to benefit the human; the robot does not know what, specifically, would be beneficial to the human, but the notion is grounded through the connection between human preferences and human choices. The initial version of the theory assumes that the human’s preferences are fixed. The weak point, then, is plasticity: the robot may be motivated to modify human preferences to make them easier to satisfy. Human Compatible, (pp. 241-245), discusses some possible solutions.

Does it matter that the agent never gets the negative reward if it's switched off?

Not really. The dichotomy between “go ahead” and “switch off” is an extreme case. In practice there would be many intermediate possibilities such as “no, don’t do that.” The robot would learn from that kind of feedback. Even in the case of switching off, it could “learn” by having the human off-switch decision added to its base of experience before it is restored to operation.

We cannot write secure programs of any scale; how can we hope to build provably beneficial AI correctly?

There are two aspects of “provably” here: (1) whether the high-level AI design, if correctly implemented, produces benefit to humans, and (2) whether the AI design is correctly implemented.

Issue (1) is a mathematical question about optimality, convergence, etc., and we prove these kinds of theorems all the time. I’m not saying it’s easy but it’s within the scope of what we know how to do. The main difficulties lie in the idealizations of “human,” “preference,” etc. We are also beginning to think about adversarial humans trying to mislead the system.

Issue (2) is, in my view, not as bad as the question suggests. For example, the CompCert optimising C compiler and the L4 Unix kernel are formally verified. AI systems can often have relatively simple software architectures composed from domain-independent capabilities for learning, reasoning, planning, etc. And because of the potentially very high economic value of such systems, and the high cost of failure, there should be ample resources to ensure correctness. (I agree that this argument has not yet succeeded within the US software industry, but eventually it will!) Bugs would typically degrade performance rather than increasing it or causing problems with respect to issue (1).

Security holes are another matter—they could enable an adversary to manipulate the AI system’s behaviour by modifying learned preferences and so on. Rule 1 for such AI systems (and for all systems IMHO) is “never run someone else’s code.” The only input to the AI system is what it perceives through its normal input channels.


How do you deal with the humans’ preferences not being in the AI's priors?

This is discussed in Human Compatible, (p. 201ff). In short, it’s a good idea to have a prior with broad support, meaning that there is a broad range of possible hypotheses about human preferences that start out with non-zero probability. In particular, one should always allow for the possibility of more things that people might care about. It’s possible to define a so-called universal prior that covers, in a technical sense, every imaginable possibility, but we don’t yet have good ideas about how to make this practical in an exact sense.

What happens if the AI has an ontological crisis/realises that the human's preferences are not grounded in rational concepts?

This is a theme in the plot of Ian McEwan’s Machines Like Me, but not, so far, a problem for AI.

First, we [data scientists and AI experts] are not in the business of saying what humans should prefer. As long as the preferences are internally consistent, the robot can aim to satisfy them (subject to conflicts with the preferences of others, of course). If they really are inconsistent, there’s nothing to be done about satisfying the inconsistent part, but the robot can help with the consistent part.

It’s worth noting that a great deal of apparent inconsistency comes from our cognitive limitations in turning underlying preferences into short-term actions and objectives, the difficulty of making choices between partially specified futures, such as choosing one career or another, and the fact that our own preferences are only partially known—we often cannot tell how much we will like or dislike something until we experience it, so we make poorly informed guesses.


How do we give an AI the capability of informing the agent about the probable upcoming danger when operating in non-friendly environment? Humans normally train the robot towards achieving a goal and giving them the rewards with respect to the taken action, but in order to prevent the danger situation from emerging, should the robot act within its negative reward?

First, it’s quite normal to allow “negative” rewards in ordinary (standard-model) decision problems. I put “negative” in quotes because the only thing that matters is the relative reward—i.e., some experiences are worse than others. The robot will avoid the worse ones, whether the rewards are –1 and +1 or +1,000,000 and +2,000,000.

Second, as I think you are pointing out, it’s hard to teach a robot (or a human) that falling off a cliff is bad by having it fall off a cliff lots of times. That costs too many robots. There are at least three solutions: (1) we provide the robot with prior knowledge of various bad outcomes, so it knows in advance to avoid them; (2) we teach the robot in simulation, and hope the results of learning transfer to the real world; (3) we allow the robot to experience “minor accidents” in the real world and we hope the learning process figures out that bigger accidents would be much worse. All of these have analogues in how we teach children.

All of the really important things in the world of human experience defy measurement (love, fear, curiosity, happiness, contentment). AI (and other algorithmic approaches) assume the existence of a measure of the thing of interest, and approximate this as one or more concrete data objects: “benefit” measured using “faces measured as smiling", “profit” or “clicks per eyeball. How can AI help us with issues of human experience and existence which defy measurement?

I think I have to disagree with the claim that AI per se assumes the existence of a data object that measures the thing of interest. The fundamental notion in AI (and in economics and various branches of philosophy) that is relevant is the idea that people have preferences about how the future unfolds. (Technically, we have preferences not just between specific futures, but also between what economists call “lotteries over futures,” meaning choices leading to uncertain outcomes with probabilities for different possible futures).

These preferences take into account love, fear, curiosity, happiness, contentment, beauty, freedom, and so on. In particular decisions that affect mainly monetary outcomes, such as choosing between two different mortgage loans that differ only in points and interest rate, the relevant measure is money; but in other decisions, such as which house to buy, lots of other things come into play, many of which are not directly measurable. The decisions a human makes then provide evidence about what things come into play.

Multiple humans, social preferences and collective decisions

If we have provably beneficial AI according to the creator, couldn't this still have detrimental effects for a larger audience?

The proposal is to create AI systems that are provably beneficial for the larger audience (everyone). It’s possible to make AI beneficial for one individual (“loyal AI”), but not recommended. See the section on “Many humans” in Human Compatible (p.213ff).

Thinking beyond individual tasks, human preferences vary hugely and can even be inconsistent within a single individual. So, whose objective should be optimised? Where does collective action and decision-making feature? E.g., negotiating social trust through complex political discourse.

Heterogeneity in human preferences isn’t a particular difficult problem. Facebook already has more than two billion individual preference profiles, so scaling up to eight billion is not hard. The robot(s) can predict the preferences of each individual separately, and acts on behalf of everyone. (At least this is the case in principle; in practice the robot’s actions are usually local and affect only a small number of people.)

The difficult bit is combining or “aggregating” preferences when trade-offs are required. Therein lie several centuries of moral philosophy, sociology, and political science. See the section on “Many humans” in Human Compatible (p.213ff).

If the preferences being modelled are of a single human, do we not run into a convergent instrumental subgoal of "achieve world domination for said human?"

See the answers to the two preceding questions. There is a separate problem, the Dr. Evil problem, having to do with people who want world-domination AI rather than beneficial AI. No easy solution.

What happens when the machine tries to learn preferences that are in some way contradictory, either because the human behaves in a way that is not consistent, or because different people’s priorities and preferences are different and not mutually consistent? Would such a machine ever converge on a consistent behaviour in its “half of the game?” Can we solve this simply by encoding the relative importance of each set of preferences (at different times), or is there more to it?...

Inconsistent preferences can certainly exist within a single individual—we have multiple internal decision processes that, in some sense, compete for control of our activities, and may operate with different driving objectives.

If you prefer plain pizza to pineapple pizza, pineapple pizza to sausage pizza, and sausage pizza to plain pizza, you are inconsistent. No robot can satisfy your preferences, because whatever pizza it gives, you, you prefer a different one. Fortunately, very few people are internally inconsistent (under normal circumstances) when it comes to preferring life to death, health to sickness, etc.

Satisfying the preferences of multiple people may be difficult even if all the preferences are the same (e.g., if everyone wants to be Ruler of the Universe). Utilitarians propose basically “adding up” the preferences and maximising the total. Relevant authors on this include Bentham, Mill, Edgewood, Sidgwick, Harsanyi, Rawls, Arrow, Sen, and Parfit.

What does the "envy" and "pride" problem tell us about inequality and AI? If preferences are relative, and if humans have a significantly stronger preference for avoiding loss than for seeking gains (loss aversion, Kahneman and Tversky), how will AI ever make decisions to reduce global inequality in a finite resource world?

These are complicated questions! As far as inequality is concerned, there is a common assumption that standard utilitarianism is unconcerned with inequality, as it cares only about total utility and not its distribution. This view fails to distinguish between utility and resources. Many, many studies show that there are diminishing returns, in terms of utility, for adding resources. Very roughly, we might say that utility is logarithmic in the amount of resources (say, wealth). This means that if we increase Alice’s wealth from $100 to $1,000 and Bob’s from $1,000,000 to $10,000,000—both tenfold increases—then they experience the same gain in utility.

Since we use only $900 in increasing Alice’s wealth from $100 to $1000, versus $9,000,000 in increasing Bob’s, a utilitarian public policy is necessarily going to be highly egalitarian in how it allocates resources. There are much greater gains in total utility from allocating resources to the least well off. Any sort of utilitarian AI would have this effect, given that it has only a certain amount of effort it can expend.

Envy and pride are relative components of a person’s overall utility, derived (at least in the simple model in the lecture) from comparisons with the wellbeing of others. If they are “equally efficient,” in the sense that the amount of extra utility Bob derives from his pride in having more resources than Alice is the same as the amount of utility Alice loses from envy at having less resources than Bob, then envy and pride do not affect the utilitarian analysis in the preceding paragraph. The real story is far more complex, because envy and pride depend on observable wellbeing, leading to a negative-sum game where people compete with unobserved effort and unnecessary consumption (see Veblen, Hirsch).

It’s an open question as to whether and how AI systems might discount envy and pride in their decisions made on behalf of multiple people. This is a big step because status, in-group identity, etc., are so important to people. There are reasons to want to reduce the influence of pride and envy on human preferences, but that gets into preference engineering, where angels fear to tread.

If a robot is only learning the preferences of a single human, and they are suddenly introduced to another human, would the robot not have bias towards the first human's preferences because otherwise it would have no belief space to act upon?

That’s an interesting question and it depends a little bit on how things are set up. Let’s call the two people Alice and Bob. First, the robot would have a strong incentive to learn about Bob’s preferences, so that it can be useful to two people and not just one. Second, there will be cases where it favours Alice and cases where it favours Bob.

For example, suppose the robot has just found a wild strawberry. Its general prior about humans is that humans like strawberries, but it knows Alice hates strawberries. In that case it will give it to Bob. On the other hand, if it knows that Alice likes strawberries far more than the average human, it will give it to Alice.

Should different AI systems learn to amalgamate their known preferences [or should they be kept separate]

Assuming the privacy issues can be resolved, it makes sense for multiple AI systems to pool their knowledge of each person’s preferences so they can all be as useful as possible to as many people as possible and can coordinate their actions better.

But keep in mind that AI systems may not be “separate” in the same sense that humans are. Even though there may be programs running on multiple computers, they can, by exchanging information, operate more or less as if they were a single entity.

We have some moral dilemmas relating to some accidental situations with respect to self-driving cars. For example, a self-driving car hits 10 people and kills them versus swerving into a wall and killing the person that is in the driving seat. In these sorts of dilemmas how are we going to determine what is beneficial to us?

These kinds of “trolley problems” date back to at least 1908 (see Human Compatible, p.178). They tend to be abstract, decontextualised, and very unlikely to arise in real life, and I’m not sure how much they really tell us about human preferences and morality.

To the extent that we humans are unable to decide the right course of action, seeing valid arguments on both sides, I’m not sure we can complain too much about whichever decision the AI makes. And the question “What would you do?” is different from “What are the principles by which all decisions should be made?”

Human Behaviour: Irrationality, Plasticity, Evil

Isn't there a problem with robots learning from human behaviour? Much of human behaviour is quite problematic?

In brief: there is no reason for the robot to behave like those it observes, any more than criminologists become criminals. There is a longer answer in Human Compatible (p.179).

How do we handle preferences that change over time? For example, how does the robot deal with humans that say one thing and do the exact opposite? E.g. say we want to prevent global warming while doing mostly things that cause it?

Preference change is discussed in Human Compatible (p.241ff). In general, it’s a very difficult question. Pettigrew’s Choosing for Changing Selves is a good recent introduction. For the new model of AI, it raises the possibility that the robot will, like advertisers and politicians, learn to deliberately modify human preferences to make them easier to satisfy.

The example of saying one thing and doing the opposite is probably not a matter of preference change, but inconsistency between preferences and actions. The latter happens all the time, e.g., when I really need to get more sleep, but I read just one email, then another, and another. We all do things we later regret doing. This mismatch makes it more difficult for robots to learn true human preferences. One solution is for the robot to learn a model of how humans actually make decisions, and invert that model to infer the underlying preferences from actual behaviour.

How do we control unethical preferences that could become the consensus at a point in time? Human preferences change over time, and can even turn very harmful and dangerous [think of voting and political choices throughout history, etc]. How can we have an AI which isn’t a people-pleaser, but rather striking the balance between steering and listening to the society’s pulse?

The most obvious example of this is what Harsanyi calls sadistic preferences: deriving positive utility from the suffering of others (see Human Compatible, p.227ff). He says, “No amount of goodwill to individual X can impose the moral obligation on me to help him in hurting a third person, individual Y.” This means zeroing out the “negative altruism” terms in the sadist’s utility function.

Beyond this, it’s not clear that it’s the robot’s job to dictate what human preferences should be. Why would we build machines to bring about ends that we prefer to avoid?

Could you please comment on how the plasticity of Basic Human Values (in the context of Schwartz’s value circumplex) would affect the design of human compatible AI?

Schwartz’s list of ten basic values is one of several attempts to catalogue the core elements of the typical human utility function. I think it’s difficult to argue here about the specifics of this list and other such lists, but the issue of plasticity applies to all attempts to describe human preferences.

Plasticity, as noted above and explained at length in Human Compatible (pp.241-245), arises from experiences and maturation processes that affect our ranking of possible futures. AI systems must allow for this possibility so that they can keep track of the changes in individual preferences. It’s important to design them so they have no incentive to deliberately modify human preferences - e.g., to make them easier to satisfy.

Improving AI capabilities

To add to those 4 conceptual breakthroughs [on slide 14 of your lecture]... would you agree that a possible 5th is the discovery/research of new ways to fuse mature models together? Providing the "glue" to building AIs block by block where each block is enriched for one corner of a vast knowledge base?

This is hard to do in the context of deep learning systems, but fairly easy to do in the context of reasoning systems based on logic and/or probability. In a knowledge-based system, we “fuse mature models” simply by joining them together, and we’re done. I mean this quite literally. You can take two logical theories expressed in the Prolog language and concatenate them; now you have a system that knows and can reason with both theories as well as take advantage of synergies between the theories. The main difficulty comes from vocabulary mismatch when the two theories originate from different sources. There are lots of practical solutions for this in the context of fusing two databases.


If it is the case—as Integrated Information Theory (IIT) suggests— that consciousness is an Artificial General Intelligence (AGI) design choice, do you think that an AGI should be designed to be conscious (i.e. sentient), or unconscious...?

I’m not a believer in IIT, but if we did have a theory of consciousness and how to create it or not create it, I’d choose not to. A truly sentient machine would still follow exactly the same predictable laws that govern how the software runs on the hardware, so this has nothing to do with making AI safe and controllable; but sentience does confer some kinds of rights, which would complicate things enormously.


As we know due to application programming interfaces (APIs) being available for pretty much all common AI tasks, deep learning seems to be merging with the current existing software engineering role in the future according to trend data available.…[As generalisation of AI roles takes place], what do you recommend to the new generation who are studying in universities and want to make careers in the AI industry?

The idea that it’s easy to build successful applications by collecting data and running it through off-the-shelf deep learning software is mostly false. Success usually requires learning about the application domain. Also, it may require understanding what the learned model is doing and why it’s not doing it well, which may mean using completely different kinds of learning methods.

I also think purely data-driven deep learning is hitting a wall. (For example, GPT-3 has used up nearly all the text that exists in the world, and it cost $12 million in electricity to train it.) Here is the quote again from François Chollet: “Many more applications are completely out of reach for current deep learning techniques – even given vast amounts of human-annotated data. … The main directions in which I see promise are models closer to general-purpose computer programs.” So, I see the pendulum swinging back towards representation, Bayesian models, probabilistic programs, etc. And there’s a lot of mathematics underlying these approaches.


Social scientist's/gossipy question: how would you describe the difference between you and Rodney Brooks as experts in the public understanding of AI?


Our research careers have been almost perfectly complementary, as Rod has focused primarily on robot mechanics, low-level perception, locomotion, low-level control, while I’ve worked in more or less all the other areas of AI but primarily on reasoning, learning, planning, etc.

In terms of our views on where the field is going and what we might look forward to or be worried about, I think there is less difference than meets the eye. We both agree that deep learning is not going to solve the AI problem. He claims to dismiss all concerns about superintelligent machines, but the core of his argument is that they will take a long time to arrive. (Initially I think he believed it would be hundreds of years, now he’s down to 100 years.)

The vast majority of AI researchers think it will be much less than 100 years, whereas I’m not quite as optimistic about the rate of progress. Perhaps the major difference is in how we imagine the impact of superintelligence. I cannot see how it can be anything but enormous.

A question on energy budgets. On p.34 of your book you mention, almost casually, that the Summit computer at Oak Ridge has around the same ‘raw capacity’ as a human brain— and uses a million times more power. Given what we know about the energy budgets of training, of the internet of things and 5G, and what we know about global warming—why continue to automate everything?

I think there are good reasons to automate things. Most humans have been used as robots for about 10,000 years, doing repetitive and often unpleasant tasks, so changing that would be good (as long as the transition is handled properly). You can see some energy data and projections in this [Nature] article: How to stop data centres from gobbling up the world’s electricity.

Right now, according to this study, data centres are less than 3% of global electricity consumption and around 0.5% of global energy consumption. This may grow, but physicists, material scientists, and electrical engineers are putting huge efforts into improving the energy efficiency of computation, for two reasons: first, to reduce the running costs of data centres, and second, to lengthen the battery life of your phone.

I've bought copies of Human Compatible for people I'd love to think about AI safety seriously. If you were going to recommend a second general-audience book, after Human Compatible, what would it be?

Brian Christian’s new book, The Alignment Problem: Machine Learning and Human Values, is pretty interesting. Max Tegmark’s Life 3.0 is very readable and thought-provoking. And of course, Bostrom’s Superintelligence is very important and a good source for careful analysis of why the first twelve things you might think of for controlling AI systems aren’t going to work.

One way of assessing and mitigating risk in a transparent way is an impact assessment—should we recommend their use as part of the general decision-making process for public authorities?

I’ve argued for something like an FDA (Food and Drug Administration) process for testing and approving simple AI systems that interface directly with the public. Impact assessments (as in environmental impact reports) tend to be rather pro forma and hardly rigorous. They are good at weeding out proposals that don’t pass the laugh test, but they’re not going to detect subtle algorithmic problems or loopholes in the mathematical foundations that underlie a claim of safety.

Given the hype culture surrounding funding in both science and technology development, wouldn't treating a deliberate Dr Evil problem or potentially a less deliberate version of it as a policing problem just be “explaining away” without addressing the why’s of the problem?

Agreed—it would be better if we could understand why some people care so little for their fellow humans. Then we might have a chance of early detection and prevention of antisocial tendencies. Still, I doubt this would be 100% effective and I’m not sure one would want to live in a world where it was 100% effective.

So, the Dr Evil problem will remain with us. Our social systems try hard to prevent uncontrolled outbreaks of evil, but control failures occur fairly frequently on a national scale and apparently every few decades on a global scale.