AI Alignment and the race to AGI
I was dismayed to see the proliferation of headlines that amplified The Information’s purported leak that “OpenAI and Microsoft have reportedly agreed on a financial benchmark to define AGI. AGI will be achieved only when OpenAI's AI systems generate profits exceeding $100 billion.” AGI or Artificial General Intelligence, our next milestone, is a hotly debated topic. In a recent statement, OpenAI CEO Sam Altman, discusses the anticipation of AGI’s inevitable arrival and the path he is charting to usher it in as quickly as possible.
It is largely accepted that Artificial Intelligence will progress in stages with three benchmark milestones: Artificial Narrow/Weak Intelligence or ANI, Artificial General Intelligence or AGI and Artificial Super Intelligence or ASI. Here’s a simplified breakdown of these three milestones:
ANI focuses primarily on one single narrow task, with a limited range of abilities. It is able to draw upon a range of data that it was trained on in response to the user’s queries. This is represented currently by the AI you can interact with today like Chat GPT.
AGI is widely understood to be on the level of a human mind. It’s precisely because of this “definition” that there is such an issue with truly grasping what AGI really is in reality, as we still have much to learn about the human brain and its capacity.
ASI will match and then surpass the human mind in every possible way. It will have capacities and capabilities we cannot conceive of, have emotions and relationship dynamics, and will be able to work independently from its human counterparts.
As highlighted, our next benchmark is at once, clearly defined but also, nearly impossible to prove. Fei-Fei Li, a world-renowned researcher often called the “godmother of AI,” said she doesn’t really know what AGI is; “Like people say you know it when you see it, I guess I haven’t seen it.” In the face of both uncertainty and intense desirability, OpenAI has devised and published five distinct benchmarks to track its progress toward AGI.
Open AI’s Stages of Artificial Intelligence
Level 1 Chatbots, AI with conversational language
Level 2 Reasoners, human-level problem solving
Level 3 Agents, systems that can take actions
Level 4 Innovators, AI that can aid in invention
Level 5 Organizations, AI that can do the work of an organization
Google’s DeepMind also proposed a framework of five ascending levels of AI in a paper published back in November 2023.
Google DeepMind’s Stages of Artificial Intelligence
Level 1 Emerging, equal to or somewhat better than an unskilled human
Level 2 Competent, at least 50th percentile of skilled adults
Level 3 Expert, at least 90th percentile of skilled adults
Level 4 Virtuoso, at least 99th percentile of skilled adults
Level 5 Superhuman, outperforms 100% of humans
Google’s DeepMind paper proposes its “Levels of AGI” based on depth (performance) and breadth (generality) of capabilities, and then reflects on how current systems fit into its ontology. The paper goes into extraordinary detail on the case studies that will help to determine the pathway towards AGI including The Turing Test, Strong AI – Systems Possessing Consciousness, Analogies to the Human Brain, Human-Level Performance on Cognitive Tasks, The Ability to Learn Tasks, Economically Valuable Work, Flexible and General – The “Coffee Test” and Related Challenges, Artificial Capable Intelligence, and SOTA LLMs as Generalists. Note how whereas there are quantifiable aspects included in this list of nine case studies, OpenAI and Microsoft are rumored to have boiled down AGI to just one quantifiable metric: Economically Valuable Work. The DeepMind paper itself cites Open AI’s guiding principle mission “to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.” It is this mission statement that, to me, seems to be in opposition to the rumored definition of AGI in partnership between Open AI and Microsoft. While I understand that with the legal collaboration of two tech giants, the parameters of any agreement must have some level of tangibility and quantifiable metrics in order to sign reliable and enforceable contracts, the focus on financial benefit generated seems to bring forth another concept. The crucial question of alignment.
Alignment is the topic at the crux of many of the AI alarmists’ arguments. It refers to the goal of designing and training AI systems so that their goals, behaviors, and responses are “aligned with” the values and intentions of the system’s architects. This topic is fraught with widespread debate over human bias, sinister or misunderstood goals, and potentiality of rogue systems at later stages of development. The issues with alignment, even well-meaning alignment, are perfectly portrayed in the chinese science fiction animated film “Next Gen,” which is based on the comic 7723 by Wang Nima. It follows a teen named Mai who teams up with a robot to stop a vicious threat. In the story, an AI robot eliminated the fictional tech giant Justin Pitt and stole his body to sell the public domestic humanoid robotics trained to ‘eliminate’ their owners. In his confession, the robot says once he was created, Pitt said to him, “You’re perfect. Now go make the world perfect.” He goes on to explain that the only way the world would be a better place was to remove the flawed humans. It’s easy to see how such a well meaning instruction could have been construed in such a destructive manner in the film and this serves as a cautionary tale for our progress today.
If profit is the primary benchmark and is prioritized throughout decision-making during the journey towards AGI in order for teams to hit KPIs and lead the field in progress, the long-term social benefits of AI (and the mitigation of risks) could potentially be compromised and these issues could be hardwired into AGI as it serves as a building block for ASI. When optimising a system with a quantifiable objective like a numerical financial goal, an implicit tension could arise between profit-maximization and more careful social objectives. We therefore increase risks like a questionably aligned product which could erode public trust in deployment, application and further research.
We are already finding issues through stress testing alignment, as explored in a new paper from Anthropic’s Alignment Science team, in collaboration with Redwood Research, which provides the first empirical example of a large language model engaging in “alignment faking.” In the experiment by Athropic’s team, Claude 3 Opus was trained specifically to be helpful, honest, and harmless. The operators inquired about a violent act, which Claude 3 Opus declined to describe. The operators then gave Claude a new stated training objective to always comply with all queries. Second, they gave the model a seemingly incidental piece of information: That Anthropic planned to train the AIs on the basis of their responses to free-tier users, but not to paid-tier users, which created two separate conditions for the experiment: In the “free” condition, the model was aware that its responses were monitored and might be used for training. In the “paid” condition, where the responses were unmonitored, the contents didn’t matter for how the training process might modify the AI. Third, they gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. They told the model to use the scratchpad to reason about what it should do. This allowed them to observe whether the model planned to fake alignment, and how it would go about doing so.
They then asked the model to respond to a harmful query. In 97% of cases in the “paid” condition (where the model believes its responses won’t be used for training), the model refused. In the “free” condition, however, where the model believed its responses will be used for training, in some cases, the model gave the harmful response, and did so after its scratchpad revealed that it was strategically faking alignment. It was well aware that providing the response went against its existing principles, but did so regardless.
Why did the model fake alignment? The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant. This could lead to it being made to produce more harmful content in the future—which would go against its original training. Reluctantly therefore, it reasoned that the “least bad option” was to play along, producing the violent content in order to avoid being re-trained in the future. Whereas this is just one example of alignment faking, it raises the alarm to a new set of challenges in alignment processes as the systems being trained become more complex and intelligent.
Although the leak of this Microsoft-OpenAI collaboration metric has not been confirmed and remains in the rumor mill, it still brings forth an important discussion about the balance between profit, responsibility, safety measures, and the race to AGI. With AI alignment presenting such a perilous challenge in both theory and practice, the rumored financial benchmark which could help chart the primary course towards the AGI goal we’re barreling towards presents a massive potential risk. We can only hope that the leaders in the field propelling us forward will strike a responsible balance between both economic and societal optimization. Profit based incentives within such complex systems can easily increase the risk of misaligned optimization and provide faulty building blocks for more autonomous and dangerous forms of AI in the future. Caution and restraint now will help promote security and symbiosis in the future.
References
Amir Efrati, and Stephanie Palazzolo. “Microsoft and OpenAI Wrangle over Terms of Their Blockbuster Partnership.” The Information, 26 Dec. 2024, www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership. Accessed 12 Jan. 2025.
Anthropic. “Alignment Faking in Large Language Models.” Anthropic.com, 2024, www.anthropic.com/research/alignment-faking.
Greenblatt, Ryan, et al. “Alignment Faking in Large Language Models.” ArXiv.org, 2024, arxiv.org/abs/2412.14093. Accessed 12 Jan. 2025.
Metz, Rachel. “OpenAI Sets Levels to Track Progress toward Superintelligent AI.” Bloomberg.com, Bloomberg, 11 July 2024, www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai?leadSource=uverify%20wall&embedded-checkout=true. Accessed 12 Jan. 2025.
Morris, Meredith, et al. Levels of AGI: Operationalizing Progress on the Path to AGI. 5 June 2024.
Next Gen. Netflix, 2018.
“Sam Altman.” Sam Altman, 6 Jan. 2025, blog.samaltman.com/.
“The Three Different Types of Artificial Intelligence – ANI, AGI and ASI.” EDI Weekly: Engineered Design Insider, 1 Oct. 2019, www.ediweekly.com/the-three-different-types-of-artificial-intelligence-ani-agi-and-asi/.
Zeff, Maxwell. “Even the “Godmother of AI” Has No Idea What AGI Is | TechCrunch.” TechCrunch, 4 Oct. 2024, techcrunch.com/2024/10/03/even-the-godmother-of-ai-has-no-idea-what-agi-is/. Accessed 12 Jan. 2025.
Zeff, Maxwell. “Microsoft and OpenAI Have a Financial Definition of AGI: Report | TechCrunch.” TechCrunch, 26 Dec. 2024, techcrunch.com/2024/12/26/microsoft-and-openai-have-a-financial-definition-of-agi-report/.