Text-to-speech (TTS) technology is constantly evolving, shifting from traditional modular pipelines to integrated large audio models (LAMs). Notably, the release of Fish Audio’s latest model, S2-Pro, has revolutionized the TTS field, advancing it towards an open architecture that enables high-quality speech synthesis with high resolution, multi-speaker synthesis, and low latency of under 150ms. Today, we will take a closer look at this innovative technology.
With the recent advancements in artificial intelligence, TTS technology has evolved beyond simple voice synthesis to a level where it can express emotions, clone voices, and enable real-time interaction. Fish Audio’s S2-Pro leads this trend and presents new possibilities for TTS technology.
The most significant feature of S2-Pro is its dual-AR architecture. Traditional TTS models have struggled to balance sequence length and acoustic detail. To solve this problem, S2-Pro separates the generation process into two stages: a ‘Slow AR’ model and a ‘Fast AR’ model. The ‘Slow AR’ model is responsible for processing linguistic input and generating meaning tokens based on the time axis, utilizing 4 billion parameters to capture long-range dependencies, rhyme, and the structural nuances of speech. The ‘Fast AR’ model, on the other hand, processes the acoustic dimension and performs residual codebook prediction for each meaning token. This model consists of 400 million parameters and efficiently generates acoustic high-frequency details, timbre, breathing, and texture.
This system also relies on Residual Vector Quantization (RVQ). RVQ compresses raw audio into discrete tokens across multiple layers (codebooks). The first layer captures the primary acoustic features, and subsequent layers capture the ‘residual’ remaining from the previous layer’s errors. This allows the model to reconstruct 44.1kHz audio while maintaining a manageable number of tokens for the transformer architecture, thereby enabling the TTS system to operate more efficiently.
S2-Pro offers an incredible emotion expression capability that developers call ‘absurdly controllable emotion.’ This is implemented through two key mechanisms: In-Context Learning (ICL) and natural language inline control. In-Context Learning overcomes the limitations of existing TTS models that required separate fine-tuning to mimic a specific voice. S2-Pro leverages the transformer’s In-Context Learning capabilities to extract the speaker’s identity and emotional state from 10-30 second reference audio clips and process them as prefixes in the context window, allowing the sequence to continue in the same voice and style.
Furthermore, the model supports natural language inline control, allowing dynamic adjustment of emotions within a single generation process. Because the training data includes descriptive linguistic markers, developers can directly insert natural language tags into the text prompt to adjust the voice tone, intensity, and rhythm in real-time. For example, using a prompt like ‘[whisper] I have a secret [laugh] that I cannot tell you’ will cause the model to naturally express a whispering voice along with laughter. As TTS technology continues to advance, the level of emotion expression will likely increase.
For integrating TTS technology into real-time applications, the most important constraint is ‘Time to First Audio’ (TTFA). S2-Pro achieves a TTFA of approximately 100ms on NVIDIA H200 hardware, providing a latency of under 150ms, creating an optimal environment for real-time interaction. This fast performance is achieved through SGLang and RadixAttention. SGLang is a high-performance serving framework, and RadixAttention supports efficient Key-Value (KV) cache management. When repeatedly using the same ‘master’ voice prompt, RadixAttention caches the KV state of the prefix, reducing the need to recompute the reference audio for each request.
The architecture is also designed to accommodate multiple speaker identities within the same context window, allowing complex dialogues or multi-character narration to be generated with a single inference call, reducing the latency required to switch models or reload weights for other speakers. The advancement of TTS technology plays a crucial role in providing services that enable real-time interaction.
Fish Audio’s S2-Pro has opened a new horizon for TTS technology. It enables high-quality voice synthesis and emotion expression through its dual-AR architecture, RVQ technology, and real-time control, and its low-latency performance, especially optimized for real-time applications, stands out. These innovations will enable TTS technology to be utilized in a wider range of fields moving forward.
In conclusion, Fish Audio S2-Pro presents the direction of TTS technology advancement and is expected to provide more innovative services and experiences in the future.
Original Source: Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion
Introduction: Is ChatGPT Really a Useless Tool? Since the emergence of ChatGPT, it has garnered…
Code Concepts: A Large-Scale Synthetic Dataset Based on Programming Concepts Code Concepts: A Large-Scale Synthetic…
The gap between closed (proprietary) large language models and transparent open-source models is rapidly shrinking.…
Gemini Embedding 2: A New Vector Model for Multimodal Data Gemini Embedding 2: A New…
Building a Self-Designing Meta-Agent: Automated Configuration, Instantiation, and Refinement There is increasing interest in meta-agents…
Fish Audio S2: A New Era of Expressive Text-to-Speech (TTS) Fish Audio S2: A New…