2026-06-25 03:40 UTCIn-site rewrite2 min readUpdated: 2026-06-25 04:12 UTC

The Artificial Analysis Speech to Speech Index

Artificial Analysis announces a new Speech to Speech Index, a composite metric evaluating native speech-to-speech models on speech reasoning, conversational dynamics, and agentic performance. OpenAI GPT-Realtime-2 (High) leads with 77.2% overall, followed by xAI Grok Voice Think Fast 1.0 at 75.7%. Deepslate Opal is the fastest model, while Gemini 3.1 Flash is the most cost-effective.

SourceHacker News AIAuthor: theanonymousone

Artificial Analysis

All articles

June 23, 2026

Announcing the Artificial Analysis Speech to Speech Index

Announcing the Artificial Analysis Speech to Speech Index, our new synthesis metric for native Speech to Speech model quality, comprising of Big Bench Audio, Full Duplex Bench, and 𝜏-Voice

The index provides a single measure of how well native Speech to Speech models perform, assessing Speech Reasoning (Big Bench Audio), Conversational Dynamics (Full Duplex Bench subset), and Agentic Performance (𝜏-Voice). Weighting is equal across all three datasets, and models must have valid results for all three to be included.

Key takeaways ➤ Model performance: OpenAI GPT-Realtime-2 (High) leads at 77.2%, followed by @xAI Grok Voice Think Fast 1.0 at 75.7%, GPT-Realtime-1.5 at 72.0%, and @GoogleAI Gemini 3.1 Flash Live Preview (High) at 69.5%. Conversational Dynamics and Agentic Performance are key differentiators of frontier models, with GPT-Realtime-2 leading in Conversational Dynamics, and Grok Voice Think Fast 1.0 leading in Agentic Performance. ➤ Speed: Deepslate Opal is the fastest model in the index with a TTFA of 0.44s, followed by GPT-Realtime-1.5 at 0.82s and Grok Voice Think Fast 1.0 at 1.25s. GPT-Realtime-2 (High) records 2.33s, with Gemini 3.1 Flash Live Preview (High) recording 2.98s. ➤ Cost: Gemini 3.1 Flash Live Preview (Minimal) is the lowest cost model in the index at $1.50, then Gemini 3.1 Flash Live Preview (High) at $1.75, Grok Voice Think Fast 1.0 at $3.00, GPT-Realtime-2 (High) at $4.14. ➤ Datasets incorporated: Big Bench Audio - 1,000 reasoning questions across Formal Fallacies, Navigate, Object Counting, and Web of Lies; Full Duplex Bench - pause handling, turn taking, interruption and backchannel handling; 𝜏-Voice - end-to-end customer service task completion across Airline, Retail, and Telecom situations.

As always, we will continue to iterate on these benchmarks and plan to add more models.

Conversational Dynamics and Agentic Performance are the key differentiators of frontier native audio models, with GPT-Realtime-2 leading in Conversational Dynamics and Grok Voice Think Fast 1.0 leading in Agentic Performance. GPT-Realtime-2 (Minimal) tops Conversational Dynamics (Full Duplex Bench) at 96.1%. Agentic Performance (𝜏-Voice) is the hardest dimension by a wide margin - Grok Voice Think Fast 1.0 leads at 52.1%, ahead of GPT-Realtime-2 (High) at 39.8%, with every model below 53%. Speech Reasoning (Big Bench Audio) is tightly clustered at the top, led by Grok Voice Think Fast 1.0 at 97.1%.

Deepslate Opal has the fastest average time to first audio (TTFA) in the index at 0.44s, scoring 62.1%. GPT-Realtime-1.5 records 0.82s at a 72.0% index score, and Grok Voice Think Fast 1.0 records 1.25s at 75.7%. GPT-Realtime-2 (High) records 2.33s at 77.2%, with Gemini 3.1 Flash Live Preview (High) recording 2.98s at 69.5%.

Gemini 3.1 Flash Live Preview (Minimal) has the lowest cost per hour of input audio in the index at $1.50, scoring 56.6%. Gemini 3.1 Flash Live Preview (High) costs $1.75 at 69.5%, Grok Voice Think Fast 1.0 costs $3.00 at 75.7%, and GPT-Realtime-2 (High) costs $4.14 at 77.2%.

Full breakdown: https://artificialanalysis.ai/speech-to-speech

Methodology: https://artificialanalysis.ai/methodology/speech-to-speech-benchmarking

Read the latest

Measuring time per task in AA-Briefcase

Agentic knowledge work can take frontier models over 20 minutes per task, as measured in AA-Briefcase, our new benchmark

June 24, 2026

Announcing AA-Briefcase: a frontier knowledge work evaluation

AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files, combining rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality.

June 18, 2026

GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index

Benchmarks and Analysis of GLM-5.2

June 16, 2026