OpenAI has introduced o3 – a new family of language models, but for now they will only be available to researchers for testing.
OpenAI has developed updated versions of its large, reasoning language models. OpenAI's new model, called o3, replaces o1, which the company introduced in September. Like o1, the new model spends time thinking about a problem to provide better answers to questions that require step-by-step logical reasoning.
OpenAI says the o3 model scored higher than its predecessor on several metrics and benchmarks used to evaluate AI. These tests include those that measure complex programming skills, as well as complex math and scientific problem-solving.
The o3 model is trained using reinforcement learning to “think” before responding. When asked a question, o3 pauses before responding, “analyzes” the information, and “explains” its reasoning as it works. After a while, the model summarizes what it believes to be the most accurate answer. The same reasoning principles were used by o1, but now users can “regulate” the reasoning time: the longer the time, the more accurate the answer.
In ARC-AGI – a test designed to assess whether an AI system can effectively learn new skills beyond the data on which it was trained – o3 achieved 87.5% of the score at high computational settings. In the worst case (at low computational settings), the model tripled the performance of o1.
The developers also report that the model outperforms o1 by 22.8 pp in SWE-Bench Verified – a benchmark focused on programming tasks. In addition, when solving programming tasks for the Codeforces rating, it scored 2727 points. Developers who receive more than 2600 points receive the title of International Grandmaster of Programming on the platform – there are just over 300 such people on Earth.
All reports about the high-profile achievements of the new model are so far only reported by OpenAI. The model will not be released publicly or by subscription yet. The company is accepting applications until January 10th to register groups of security testers and researchers who will be the first to assess the capabilities and risks of the new model.