Fish Audio S2: A New Era of Expressive Text-to-Speech (TTS)

Text-to-speech (TTS) technology is constantly evolving, shifting from traditional modular pipelines to integrated large audio models (LAMs). Notably, the release of Fish Audio’s latest model, S2-Pro, has revolutionized the TTS field, advancing it towards an open architecture that enables high-quality speech synthesis with high resolution, multi-speaker synthesis, and low latency of under 150ms. Today, we will take a closer look at this innovative technology.

With the recent advancements in artificial intelligence, TTS technology has evolved beyond simple voice synthesis to a level where it can express emotions, clone voices, and enable real-time interaction. Fish Audio’s S2-Pro leads this trend and presents new possibilities for TTS technology.

S2-Pro’s Core Technology: Dual-AR Architecture and RVQ

The most significant feature of S2-Pro is its dual-AR architecture. Traditional TTS models have struggled to balance sequence length and acoustic detail. To solve this problem, S2-Pro separates the generation process into two stages: a ‘Slow AR’ model and a ‘Fast AR’ model. The ‘Slow AR’ model is responsible for processing linguistic input and generating meaning tokens based on the time axis, utilizing 4 billion parameters to capture long-range dependencies, rhyme, and the structural nuances of speech. The ‘Fast AR’ model, on the other hand, processes the acoustic dimension and performs residual codebook prediction for each meaning token. This model consists of 400 million parameters and efficiently generates acoustic high-frequency details, timbre, breathing, and texture.

This system also relies on Residual Vector Quantization (RVQ). RVQ compresses raw audio into discrete tokens across multiple layers (codebooks). The first layer captures the primary acoustic features, and subsequent layers capture the ‘residual’ remaining from the previous layer’s errors. This allows the model to reconstruct 44.1kHz audio while maintaining a manageable number of tokens for the transformer architecture, thereby enabling the TTS system to operate more efficiently.

Emotion Control: In-Context Learning and Inline Tags

S2-Pro offers an incredible emotion expression capability that developers call ‘absurdly controllable emotion.’ This is implemented through two key mechanisms: In-Context Learning (ICL) and natural language inline control. In-Context Learning overcomes the limitations of existing TTS models that required separate fine-tuning to mimic a specific voice. S2-Pro leverages the transformer’s In-Context Learning capabilities to extract the speaker’s identity and emotional state from 10-30 second reference audio clips and process them as prefixes in the context window, allowing the sequence to continue in the same voice and style.

Furthermore, the model supports natural language inline control, allowing dynamic adjustment of emotions within a single generation process. Because the training data includes descriptive linguistic markers, developers can directly insert natural language tags into the text prompt to adjust the voice tone, intensity, and rhythm in real-time. For example, using a prompt like ‘[whisper] I have a secret [laugh] that I cannot tell you’ will cause the model to naturally express a whispering voice along with laughter. As TTS technology continues to advance, the level of emotion expression will likely increase.

Performance Benchmarks and SGLang Integration

For integrating TTS technology into real-time applications, the most important constraint is ‘Time to First Audio’ (TTFA). S2-Pro achieves a TTFA of approximately 100ms on NVIDIA H200 hardware, providing a latency of under 150ms, creating an optimal environment for real-time interaction. This fast performance is achieved through SGLang and RadixAttention. SGLang is a high-performance serving framework, and RadixAttention supports efficient Key-Value (KV) cache management. When repeatedly using the same ‘master’ voice prompt, RadixAttention caches the KV state of the prefix, reducing the need to recompute the reference audio for each request.

The architecture is also designed to accommodate multiple speaker identities within the same context window, allowing complex dialogues or multi-character narration to be generated with a single inference call, reducing the latency required to switch models or reload weights for other speakers. The advancement of TTS technology plays a crucial role in providing services that enable real-time interaction.

Key Summary

Fish Audio’s S2-Pro has opened a new horizon for TTS technology. It enables high-quality voice synthesis and emotion expression through its dual-AR architecture, RVQ technology, and real-time control, and its low-latency performance, especially optimized for real-time applications, stands out. These innovations will enable TTS technology to be utilized in a wider range of fields moving forward.

In conclusion, Fish Audio S2-Pro presents the direction of TTS technology advancement and is expected to provide more innovative services and experiences in the future.

In-depth Analysis and Implications

Dual-AR Architecture: Improved both the efficiency and quality of speech synthesis by using separated Slow and Fast AR models.
Sub-150ms Latency: Achieved low latency suitable for real-time applications, enabling use in environments requiring immediate feedback.
Hierarchical RVQ Encoding: Efficiently compressed 44.1kHz audio while maintaining high audio quality and reducing computational costs.
Zero-Shot In-Context Learning: Replicated new voices and controlled emotions using only short reference audio clips, increasing the usability of TTS models.
RadixAttention & SGLang Integration: Supports fast performance in production environments by caching KV states and enabling efficient serving.

Original Source: Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

PENTACROSS

Next Building a Self-Designing Meta-Agent: Automated Configuration, Instantiation, and Refinement »

Previous « 피쉬 오디오 S2: 표현력이 뛰어난 텍스트 음성 변환(TTS)의 새로운 시대

Published by

PENTACROSS

19시간 ago

Implementing a Linear Regression Model in Python Without Machine Learning Libraries

Fish Audio S2: A New Era of Expressive Text-to-Speech (TTS)

Fish Audio S2: A New Era of Expressive Text-to-Speech (TTS)

S2-Pro’s Core Technology: Dual-AR Architecture and RVQ

Emotion Control: In-Context Learning and Inline Tags

Performance Benchmarks and SGLang Integration

Key Summary

In-depth Analysis and Implications

Recent Posts

How to Use ChatGPT Like a Pro: 10 Workflows That Save You Hours Every Week

Code Concepts: A Large-Scale Synthetic Dataset Based on Programming Concepts

NVIDIA Nemotron 3 Super: Open-Source Hybrid Mamba-Attention MoE Model Released, 5x Higher Throughput for Agentic AI

Gemini Embedding 2: A New Vector Model for Multimodal Data

Building a Self-Designing Meta-Agent: Automated Configuration, Instantiation, and Refinement