Technology

AI System Achieves Human-Level General Intelligence

Published December 31, 2024

A groundbreaking achievement has been made in the field of artificial intelligence (AI). A new AI model has reached human-level performance on a test meant to assess what is known as "general intelligence." This significant milestone was marked on December 20, when OpenAI's o3 system attained an impressive score of 85% on the ARC-AGI benchmark. This score surpasses the previous best score by AI, which was 55%, and matches the average score typically seen in humans.

The development of artificial general intelligence (AGI) is a primary target for many AI research laboratories. At first glance, OpenAI seems to have made a notable step towards realizing this ambitious goal.

Despite skepticism still existing, a growing number of AI researchers and developers sense that a pivotal moment has occurred. Many believe that the idea of AGI now feels more tangible, urgent, and closer than they had initially thought. But is this perception accurate?

Understanding General Intelligence

To appreciate the achievement of the o3 system, it's important to know what the ARC-AGI test entails. Essentially, it measures how efficiently an AI system can adapt to new information, gauging its "sample efficiency"—the number of examples it requires to learn and generalize to new situations.

For comparison, a model like ChatGPT (GPT-4) has limited sample efficiency. It was trained on millions of instances of human-written text, which helped it form probabilistic rules about how words combine. This training makes it adept at common tasks but often struggles with rarer or more unique tasks due to insufficient data.

Until AI systems demonstrate the capacity to learn with fewer examples and adapt more efficiently, they will likely be confined to repetitive tasks where occasional errors are acceptable.

A key measure of intelligence is the ability to generalize, meaning to solve new or unfamiliar problems effectively using limited information.

Testing with Patterns and Grids

The ARC-AGI benchmark focuses on assessing sample-efficient adaptation through tasks involving grids, where the AI must identify patterns that transform one grid layout into another. For each question, the AI is provided with three examples that it must learn from to figure out the rule that applies to move from the known to the unknown.

These tasks resemble the IQ tests that many may remember from their schooling.

Learning Weak Rules and Adapting

While the exact methods employed by OpenAI to train the o3 system remain unclear, the results indicate that it is exceptionally adaptable. The model seems to successfully identify general rules from just a few examples.

Identifying patterns without making unnecessary assumptions is crucial for this process. The concept of "weak rules" comes into play, meaning that simpler rules are generally more effective for a broader range of applications.

For instance, a straightforward expression of a rule might be, "Any shape with an extended line will move to the end of that line, obscuring any shapes it overlaps with." This simplicity facilitates better adaptation.

Exploring Problem-Solving Strategies

Although the specific training methods used for the o3 system are not yet public, it’s thought that the model analyzes various "chains of thought" to determine the best method for solving tasks. This approach may resemble how Google's AlphaGo system evaluated different move sequences to defeat top-level Go players.

Similar to programming, these chains of thought allow the model to fit its understanding to the examples. If akin to AlphaGo, this system may be selecting its method based on heuristics—rules of thumb that simplify decision-making.

Future of AGI and Key Questions

The pressing question remains: Does achieving these results bring us closer to true AGI? If the operations of the o3 model are as speculated, then the underlying framework may not represent a substantial leap forward compared to earlier iterations. The conceptual understanding the model acquires from language might not be significantly more capable of generalization than existing models.

Our current understanding of the o3 model is limited. OpenAI has kept its disclosures minimal, sharing information mainly through a few media briefings and limited testing with selected researchers and institutions focused on AI safety.

Ultimately, comprehending the full potential of the o3 system will involve thorough evaluations and understanding its capabilities, including its failure and success rates.

Once o3 is publicly released, we will gain clearer insights into its adaptability relative to humans. If it proves to be highly adaptable, it could herald a transformative economic impact, introducing a new era of self-improving intelligence. This scenario would necessitate revised benchmarks for AGI assessment and significant discussions on governance.

However, if the results are not as promising, it will still represent a remarkable achievement, with day-to-day life remaining largely unchanged.

AI, AGI, OpenAI