ChatGPT announced a new version of ChatGPT that can accept audio, image and text inputs and also generate outputs in audio, image and text. OpenAI is calling the new version of ChatGPT 4o, with the “o” standing for “omni” which is a combining form word that means “all”.
ChatGPT 4o (Omni)
OpenAI described this new version of ChatGPT as a progression toward more natural human and machine interactions which responds to user inputs at the same speed as a human to human conversations. The new version matches ChatGPT 4 Turbo in English and significantly outperforms Turbo in other languages. There is a significant improvement in API performance, increasing in speed and operating 50% less expensively.
The announcement explains:
“As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.”
Advanced Voice Processing
The previous method for communicating with voice involved bridging together three different models to handle transcribing voice inputs to text where the second model (GPT 3.5 or GPT-4) processes it and outputs text and a third model that transcribes the text back into audio. That method is said to lose nuances in the various translations.
OpenAI described the downsides of the previous approach that are (presumably) overcome by the new approach:
“This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.”
The new version doesn’t need three different models because all of the inputs and outputs are handled together in one model for end to end audio input and output. Interestingly, OpenAI states that they haven’t yet explored the full capabilities of the new model or fully understand the limitations of it.
New Guardrails And An Iterative Release
OpenAI GPT 4o features new guardrails and filters to keep it safe and avoid unintended voice outputs for safety. However today’s announcement says that they are only rolling out the capabilities for text and image inputs and text outputs and a limited audio at launch. GPT 4o is available for both free and paid tiers, with Plus users receiving 5 times higher message limits.
Audio capabilities are due for a limited alpha-phase release for ChatGPT Plus and API users within weeks.
The announcement explained:
“We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies.”
Read the announcement:
Featured Image by Shutterstock/Photo For Everything