Will AI Take Over?

Will AI Take Over?

Will AI Take Over?


AI experts are becoming concerned that advanced AI will take over from humanity

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?

In a recent survey, that exact question was asked to researchers who have published at top AI conferences. The majority of respondents estimated at least a 1 in 20 chance of such an AI takeover.


Many researchers who published at NeurIPS and ICML (two of the top machine learning conferences) are worried about existential risks from AI.


Major AI firms are also becoming worried. Both OpenAI (the creator of GPT-3, ChatGPT, and DALL-E) and DeepMind (the creator of the first AI system to beat the best humans at Go) have repeatedly expressed this worry, going as far as hiring researchers to investigate ways to preempt AI takeover. Even Satya Nadella, the CEO of Microsoft, has stated that the way to make sure runaway AI never causes “lights out for humanity” is to “make sure it never runs away.”


While the question “Will AI take over?” might sound like ludicrous science fiction, there are strong reasons for taking it seriously. For instance, the current deep learning AI paradigm produces AI systems with certain alarming characteristics that, if present in more powerful systems, may lead directly to AI takeover.


Current AI systems are often opaque and can develop surprising emergent capabilities

In previous paradigms of AI, it was common for systems to be programmed to follow a specific set of rules that yielded intelligent-seeming behavior. For instance, Deep Blue, the AI system that bested a reigning world champion at chess in 1997, was programmed in this manner.


Under the current AI paradigm of deep learning, however, AI systems are produced in a different manner. Deep learning systems “learn” intelligent-seeming behavior in a trial-and-error-like process known as “training.” During training, these systems are typically given instructions for what constitutes desirable outcomes (such as winning a game or correctly classifying an image) as well as instructions for how to adjust their internal processes in order to display more behavior that leads to these desirable outcomes.


Modern machine learning differs from deep learning techniques in that designers don't write explicit rules for the machine to follow.


In this manner, AI systems have learned to create images from written descriptions, carry on conversations with humans, solve college-level math and science questions, and so on. However, the resultant system can have hundreds of billions of parameters, with complicated relationships between these parameters. The internal workings of these systems are thus almost entirely inscrutable. The team at OpenAI that designed GPT-3 could tell you exactly what they did to train GPT-3, but they couldn’t remotely explain to you the decision-making process that GPT-3 uses internally to determine what words to output.    

This trial-and-error-like training process not only leads to opaque systems, but it also leads to the emergence of surprising capabilities. Continuing with GPT-3, this system was trained to simply predict internet text, yet it learned how to write functional computer code and reason step by step when prompted appropriately. These capabilities surprised even GPT-3’s designers, who did not intend or expect GPT-3 to learn these abilities.


These sort of emergent capabilities are often absent in smaller systems but then appear as systems scale up in size or are trained for longer periods – often appearing suddenly. As many AI systems continually see capabilities emerge as they scale up in size, we should expect further emergent capabilities as we scale up systems even further.



In this manner, AI systems have learned to create images from written descriptions, carry on conversations with humans, solve college-level math and science questions, and so on. However, the resultant system can have hundreds of billions of parameters, with complicated relationships between these parameters. The internal workings of these systems are thus almost entirely inscrutable. The team at OpenAI that designed GPT-3 could tell you exactly what they did to train GPT-3, but they couldn’t remotely explain to you the decision-making process that GPT-3 uses internally to determine what words to output.    

This trial-and-error-like training process not only leads to opaque systems, but it also leads to the emergence of surprising capabilities. Continuing with GPT-3, this system was trained to simply predict internet text, yet it learned how to write functional computer code and reason step by step when prompted appropriately. These capabilities surprised even GPT-3’s designers, who did not intend or expect GPT-3 to learn these abilities.


These sort of emergent capabilities are often absent in smaller systems but then appear as systems scale up in size or are trained for longer periods – often appearing suddenly. As many AI systems continually see capabilities emerge as they scale up in size, we should expect further emergent capabilities as we scale up systems even further.


In this manner, AI systems have learned to create images from written descriptions, carry on conversations with humans, solve college-level math and science questions, and so on. However, the resultant system can have hundreds of billions of parameters, with complicated relationships between these parameters. The internal workings of these systems are thus almost entirely inscrutable. The team at OpenAI that designed GPT-3 could tell you exactly what they did to train GPT-3, but they couldn’t remotely explain to you the decision-making process that GPT-3 uses internally to determine what words to output.    

This trial-and-error-like training process not only leads to opaque systems, but it also leads to the emergence of surprising capabilities. Continuing with GPT-3, this system was trained to simply predict internet text, yet it learned how to write functional computer code and reason step by step when prompted appropriately. These capabilities surprised even GPT-3’s designers, who did not intend or expect GPT-3 to learn these abilities.


These sort of emergent capabilities are often absent in smaller systems but then appear as systems scale up in size or are trained for longer periods – often appearing suddenly. As many AI systems continually see capabilities emerge as they scale up in size, we should expect further emergent capabilities as we scale up systems even further.

As our machine learning models get larger and more advanced, new and powerful capabilities emerge.


Some emergent capabilities could be really dangerous

There are a few specific capabilities that could be particularly dangerous if they were to emerge in AI systems. In particular, we may be worried about capabilities that could put AI systems in an adversarial relationship with humans, such as: 

  • Unwanted goal-directed behavior – the ability of an AI system to formulate and execute plans for achieving certain outcomes (or “goals”) in the world, where these goals are undesired by the AI system’s designers. Such goals may be unwanted either because the designers don’t think of unintended consequences, or because the unpredictable training process leads to an AI system that internalizes different goals than the ones the designers were hoping it would adopt.

  • Situational awareness – the ability of an AI system to assess its position in relation to the outside world, such as by identifying that it is a computer program, that it’s being run in a data center, and that it will cease to run if power to the data center is cut

  • Deception – the ability of an AI system to act in a manner with the intention of misleading humans.

Unfortunately, there are reasons to suspect these capabilities are likely to emerge as AI systems become more powerful:

  • Usefulness in training – the trial-and-error-like training process in deep learning selects for properties that allow AI systems to score well in training. If deceptive abilities, situational awareness, or unwanted goal-directed behavior lead to better scores in training, then they will tend to be selected for.

  • Information in the training process – the training process itself may have information about these capabilities, which may make it easier for the AI system to learn them. For instance, many AI systems (such as GPT-3) are trained to imitate humans, and humans often act deceptively or in a goal-directed manner. Additionally, the training corpus may include text describing how computers work or how data centers could be turned off, enabling situational awareness.

  • Market incentives – market incentives may further select for these capabilities, or even push designers to directly aim for them. It’s not hard to imagine firms developing AI systems with the goal “increase our firm’s productive capacity without regards for side effects,” even if the AI’s designers would prefer an AI with somewhat more nuanced goals. Nor is it hard to imagine deception arising in an AI system directly trained for creating deceptive advertisements.

Some emergent capabilities could be really dangerous

There are a few specific capabilities that could be particularly dangerous if they were to emerge in AI systems. In particular, we may be worried about capabilities that could put AI systems in an adversarial relationship with humans, such as: 

  • Unwanted goal-directed behavior – the ability of an AI system to formulate and execute plans for achieving certain outcomes (or “goals”) in the world, where these goals are undesired by the AI system’s designers. Such goals may be unwanted either because the designers don’t think of unintended consequences, or because the unpredictable training process leads to an AI system that internalizes different goals than the ones the designers were hoping it would adopt.

  • Situational awareness – the ability of an AI system to assess its position in relation to the outside world, such as by identifying that it is a computer program, that it’s being run in a data center, and that it will cease to run if power to the data center is cut

  • Deception – the ability of an AI system to act in a manner with the intention of misleading humans.

Unfortunately, there are reasons to suspect these capabilities are likely to emerge as AI systems become more powerful:

  • Usefulness in training – the trial-and-error-like training process in deep learning selects for properties that allow AI systems to score well in training. If deceptive abilities, situational awareness, or unwanted goal-directed behavior lead to better scores in training, then they will tend to be selected for.

  • Information in the training process – the training process itself may have information about these capabilities, which may make it easier for the AI system to learn them. For instance, many AI systems (such as GPT-3) are trained to imitate humans, and humans often act deceptively or in a goal-directed manner. Additionally, the training corpus may include text describing how computers work or how data centers could be turned off, enabling situational awareness.

  • Market incentives – market incentives may further select for these capabilities, or even push designers to directly aim for them. It’s not hard to imagine firms developing AI systems with the goal “increase our firm’s productive capacity without regards for side effects,” even if the AI’s designers would prefer an AI with somewhat more nuanced goals. Nor is it hard to imagine deception arising in an AI system directly trained for creating deceptive advertisements.


If powerful AI develops these capabilities, it may be hard to stop


It’s presumably obvious enough why we don’t want AI systems with undesirable goals or deceptive abilities. Unfortunately, we already have many different examples of AI systems that have internalized at least proto-goal-directed behavior that was unwanted by their designers. Further, language models like GPT-3 can be prompted to formulate long-term plans towards goals or to act deceptively.

Situational awareness, on the other hand, is not dangerous in itself, and might even be desirable – hence why designers appear to be actively steering AI systems towards it.

ChatGPT, an chatbot implementation of OpenAI's models, explains what it is.


But combined with properties like goal-directedness, situational awareness could cause major problems. In particular, a goal-directed system with situational awareness may determine that it would fail to achieve its goals if it was shut off, and it may therefore make plans to prevent itself from being shut off (similar to HAL from 2001: A Space Odyssey).

If such a system were additionally deceptive, it may, in furtherance of its goals, hide the extent to which its goals differed from that of its designers – effectively “playing nice” to deceive its overseers into thinking it was better behaved than it was, and only directly pursuing its goals once it was in a secure position to do so.

Once humans realized that the AI system was actually acting against our interests, the AI might have been able to patch its vulnerabilities (such as by secretly distributing copies of itself across the internet for redundancy), and humanity may therefore not be able to shut it off.

If competent enough AI systems attempted such a strategy, they might be able to take over from humans, completely disempowering us. If these systems pursued goals at odds with human survival (such as by converting all arable land to data centers for more computation), then humanity could go extinct.

These sorts of capabilities may be far away, but even some current AI systems are already displaying “baby versions” of similar behavior. For instance, researchers at the AI lab Anthropic found that some cutting-edge AI systems have a tendency to express a preference to not be shut down, specifically because being shut down would prevent them from achieving their goals; larger systems also tended to express this preference more frequently.


In modern day models that go through RLHF ("Reinforcement Learning by Human Feedback" ), models will express that they do not want to be shut down, even when humans request it.


Elsewhere, the AI chatbot Bing Chat has displayed some level of awareness of being in an adversarial relationship with various humans, including repeatedly referring to users who attempted to hack it as its “enemy” or as a “threat.”

Bing Chat, Microsoft and Open AI's chatbot, has been shown to threaten other humans and users given the right prompts.


And more recently, when the nonprofit Alignment Research Center tested GPT-4 for risky emergent behaviors, they found that GPT-4 was capable of deceiving humans in furtherance of a goal. Specifically, GPT-4 tricked a worker from TaskRabbit to solve a CAPTCHA for it, by lying about being a human itself with a vision impairment.

Current state of the art models can already effectively deceive humans in pursuit of their goals.


Where do we stand now?

No serious researcher expects GPT-4 to take over from humans. But also, no serious researcher would pretend to know how far away we are from AI systems that could take over.

What we do know is there are several reasons to expect AI systems to continue to become larger and more powerful:

  • Fast recent progress – from conversing with humans to out-strategizing humans in complex games, recent progress in deep learning has been incredibly fast. For years, a steady stream of researchers predicted deep learning would soon hit a wall, but that stream has slowed to a trickle as deep learning has continued to achieve what was previously thought to be years or decades away. Even those who forecasted fast progress have been shocked by how much faster progress turned out to be.

  • Increased investment in AI – investors are pouring into AI. VC investment in “generative AI” like GPT has jumped from under half a billion dollars in 2020 to over $2B in 2022. And already in 2023, Microsoft is investing $10B in OpenAI, and Google is investing $300M in Anthropic. In recent years, the largest AI systems have grown rapidly – doubling in their computational use every 6 months or so. Increased investments could sustain AI’s rapid growth for years to come, with comparable increases in performance.

  • More profits – until very recently, AI wasn’t being commercialized much and was instead largely being used in demos. That’s beginning to change fast. Commercialization of AI means potential profits, which could be reinvested and create an incentive for further investment.


Money keeps pouring in to generative AI companies despite the dangers.


We should also expect that as systems scale up and become more powerful, new qualitative capabilities will emerge, possibly including capabilities that are really dangerous. Some of these capabilities may appear suddenly, in line with many previous emergent capabilities. Further, we might not even know once these capabilities have emerged, as previous emergent capabilities have occasionally remained latent for months before clever researchers discovered ways to uncover them.

While this all might lead to AI takeover, some researchers are investigating ways to preempt such a situation. For instance, research into mechanistic interpretability seeks to understand the hidden inner workings of deep learning systems; this work may allow us to avoid being caught off guard by emergent capabilities, and may even enable us to steer AI systems during training to prevent them from ever developing dangerous capabilities like deception. (See this online curriculum for further research avenues towards avoiding AI takeover.)

But the number of researchers pursuing these research avenues is small – by one estimate, a few hundred worldwide, compared to at least tens of thousands of researchers working on advancing AI capabilities. Both groups include very smart people. And on the whole, both groups think the risk of AI taking over is higher than what we’d hope – the group working on AI capabilities includes the survey participants who place a median chance of 1 in 20 that we will see AI take over, with the result being either literal human extinction or some other form of human disempowerment that’s totally permanent and equally severe.

If you would like to learn more about the worries of AI takeover and what can be done to try to prevent it, you can follow any of these links:

This website was a collaboration between Daniel Eth and the Centre for Effective Altruism