Fish Audio S2 Technical Report

Title:

Description:

# Fish Audio S2 Technical Report

Fish Audio Team Please send correspondence to opensource@fish.audio.

###### Abstract

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100​ms100\,\text{ms}. Our code and weights are available on [GitHub](https://github.com/fishaudio/fish-speech) and [Hugging Face](https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit [https://fish.audio](https://fish.audio/) to try custom voices.

[https://github.com/fishaudio/fish-speech](https://github.com/fishaudio/fish-speech)

[https://huggingface.co/fishaudio/s2-pro](https://huggingface.co/fishaudio/s2-pro)

[https://fish.audio](https://fish.audio/)

## 1 Introduction

Figure 1: Fish Audio S2 is a multilingual, controllable, and expressive TTS system supporting long-form, multi-speaker, multi-turn generation with ultra-low TTFA and RTF.

High-quality, controllable text-to-speech (TTS) has become increasingly important in modern AI systems, enabling scalable audio content creation and natural conversational experiences across applications such as audiobook narration, video dubbing, and personalized chatbots. Recent progress in TTS has been driven by large-scale models (Zhang et al., 2025; Du et al., 2025; Li et al., 2026; Hu et al., 2026). Many of these works follow a two-stage paradigm: conditioned on text, the model first produces high-level discrete speech tokens, which are then decoded into the full waveform by a separate acoustic decoder (Wang et al., 2023; Défossez et al., 2022; Kong et al., 2020; Anastassiou et al., 2024).

Alongside these architectural innovations, the success of large-scale TTS relies heavily on robust data curation. Recent efforts have introduced sophisticated pipelines for cleaning speech corpora and annotating paralinguistic features (Cheng et al., 2025; Yang et al., 2025). However, generating fine-grained natural-language instructions for vocal features at scale remains a major bottleneck. From a training perspective, although reinforcement learning (RL) methods such as Direct Preference Optimization (DPO) (Rafailov et al., 2023), Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Group Relative Policy Optimization (GRPO) (Shao et al., 2024) have become standard for improving model behavior in the large language model (LLM) domain (Guo et al., 2025; Agarwal et al., 2025), their adoption in TTS remains limited.

In this report, we present Fish Audio S2, which retains the decoder-only Transformer backbone and RVQ-based audio codec of Fish Audio S1 (Liao et al., 2024b). We extends it with a unified data curation and RL alignment framework to improve controllability, naturalness, and robustness in speech generation. Specifically, we introduce two key technical innovations:

•

Multi-Purpose Data Pipeline. We build a data pipeline with a speech quality assessment model and a rich-transcription ASR model to filter and annotate large-scale audio data for TTS pre-training. The same models are then directly reused as reward signals for RL alignment, eliminating distribution mismatch between the two stages.

•

Multi-Reward RL Alignment. We implement a variant of GRPO that jointly optimizes semantic accuracy, acoustic quality, and speaker similarity, ensuring a balance between expressiveness and robustness.

These innovations directly enable three major functional breakthroughs:

•

Enhanced Instruction Following. Fish Audio S2 exhibits superior adherence to natural language instructions. It enables broad and fine-grained control over speech generation through free-form textual descriptions.

•

Native Multi-Speaker and Multi-Turn Generation. The model can natively generate complex, interleaved dialogues involving multiple distinct speakers in a single pass, capturing the dynamics of natural conversation.

•

Stable Long-Form Synthesis. The system supports the generation of coherent and continuous audio, maintaining stability and consistency over extended durations.

To evaluate our model, we conduct extensive experiments along two complementary tracks: (i) objective evaluation and (ii) LLM-as-a-Judge assessments of higher-level capabilities. For intelligibility, content accuracy, long-form, and multilingual performance, we report Word Error Rate (WER), Character Error Rate (CER), and speaker similarity on widely used benchmarks, including Seed-TTS-Eval (Anastassiou et al., 2024), MiniMax Multilingual Testset (Zhang et al., 2025), CosyVoice3-Eval (Du et al., 2025), Long-TTS-Eval (Wang et al., 2025a). Across public benchmarks, Fish Audio S2 shows consistently strong objective performance, achieving leading results on Seed-TTS benchmark while maintaining robust multilingual intelligibility and speaker similarity on both the MiniMax Multilingual Testset and CV3-Eval. To assess higher-level capabilities such as instruction following and human-likeness, we further employ the Audio Turing Test (Wang et al., 2025b) and Emergent TTS Eval (Manku et al., 2025). On the Audio Turing Test, Fish Audio S2 achieves a posterior mean of 0.483, which further improves to 0.515 with instruction rewriting. On Emergent TTS Eval, it reaches an overall win rate of 81.88% against the baseline, further supporting its strong instruction-following capability. Furthermore, to address the lack of dedicated benchmarks for fine-grained control, we introduce a novel evaluation benchmark, the Fish Audio Instruction Benchmark, which systematically evaluates models’ inline tag-following performance across English and Chinese. On the Fish Audio Instruction Benchmark, Fish Audio S2 achieves an overall tag-activation rate of 93.3% and an overall quality score of 4.51/5.0 across English and Chinese, as evaluated by Gemini 3 Pro.

To accelerate research and lower the barrier to high-quality TTS development, we publicly release our model weights, fine-tuning code, and the SGLang-based inference engine on [GitHub](https://github.com/fishaudio/fish-speech) and [Hugging Face](https://huggingface.co/fishaudio/s2-pro). We also highly encourage readers to explore interactive demos at our official site [https://fish.audio/](https://fish.audio/).

The remainder of this paper is organized as follows: Section 2 details the model architecture; Section 3 describes the data curation pipeline; Section 4 presents the pre-training and RL-based post-training; Section 5 instrodces our inference engine and its performance, Section 6 presents the experimental setup and comprehensive evaluation results; and finally, Section 7 concludes with a discussion on limitations and future directions.

## 2 Architecture

Figure 2: Fish Audio S2 architecture.

### 2.1 Audio Tokenizer

Our audio tokenizer is built upon the architecture of the Descript Audio Codec (DAC) (Kumar et al., 2023), optimized for high-fidelity, real-time streaming at a 44.1​kHz44.1\,\text{kHz} sampling rate. The model employs a hierarchical Residual Vector Quantization (RVQ) strategy utilizing NN codebooks (N=10N{=}10 in our model): the primary codebook serves as the semantic codebook, while the remaining nine capture progressively finer-grained acoustic details.

Streaming Architecture. To adapt the vanilla DAC for low-latency TTS tasks, we introduce several key modifications to the encoder and decoder structures:

•

Causal Convolutions. We refactor the model to be strictly causal by replacing standard convolutions with masked causal convolutions. This ensures the generation process depends solely on past context, enabling low-latency streaming capabilities.

•

Transformer Bottleneck. Following the design of Mimi (Défossez et al., 2024), we integrate causal sliding-window Transformer blocks both before and after the RVQ layers. By restricting attention to a fixed-size window, this mechanism models long-range dependencies with bounded memory usage, preventing out-of-memory issues during long-form inference.

•

Extended Downsampling.. The encoder extends the standard DAC encoder (512×512\times) with additional ConvNeXt V2 (Woo et al., 2023) layers (4×4\times), achieving a total downsampling ratio of 2048 and a compact frame rate of approximately 21​Hz21\,\text{Hz}.

•

EVA-GAN Decoder. Instead of the original DAC decoder, we employ the structure of EVA-GAN (Liao et al., 2024a) as our generator, which significantly improves parameter efficiency and synthesis quality, providing a more robust reconstruction of fine-grained acoustic details compared to the original DAC decoder.

Semantic Distillation. To ensure that the first codebook captures rich linguistic and phonetic information, we adopt semantic distillation following (Défossez et al., 2024). During training, an auxiliary semantic prediction head is jointly optimized to regress the 16th-layer activations of a pre-trained w2v-BERT 2.0 model (Barrault et al., 2023). By feeding the quantized features from the first codebook into this head, we encourage the bottleneck to retain rich semantic representations, thereby enabling more stable alignment in downstream TTS.

### 2.2 Dual-Autoregressive Generation

When modeling high-fidelity acoustic features extracted by the audio tokenizer, directly flattening the 10-layer RVQ codebooks along the time axis leads to a tenfold increase in sequence length, severely limiting the LLM’s ability to handle long contexts. To address this dimensionality challenge, we apply a Dual-Autoregressive (Dual-AR) architecture (Liao et al. (2024b)) that decouples temporal semantic modeling from depth-wise acoustic modeling, as illustrated in Figure 2. This architecture comprises a core Temporal Semantic Backbone (Slow AR) coupled with a lightweight Depth-wise Acoustic Decoder (Fast AR).

Slow AR. We adopt a pretrained Qwen3-4B as the Slow AR backbone. The Slow AR operates autoregressively over the full token sequence, which interleaves text tokens (e.g. system prompts, target text) with discrete audio tokens. During audio generation, it predicts the semantic token qt(0)q^{(0)}_{t} from the first RVQ codebook at each time step tt. Since this codebook undergoes semantic distillation during tokenizer training, the Slow AR can effectively plan linguistic content and coarse prosodic structure, analogous to standard text generation.

Fast AR. Given the semantic tokens generated by the Slow AR, we introduce a lightweight Fast AR network—consisting of 4 Transformer layers with independent weights and embedding tables—to reconstruct the remaining fine-grained acoustic details. At each time step tt, the Slow AR first samples the semantic token qt(0)q^{(0)}_{t} and emits a hidden state 𝐡tslow\mathbf{h}_{t}^{\text{slow}}. The Fast AR then generates the remaining N−1N{-}1 acoustic tokens qt(1),…,qt(N−1)q^{(1)}_{t},\dots,q^{(N-1)}_{t} through a depth-wise autoregressive process. The hidden state 𝐡tslow\mathbf{h}_{t}^{\text{slow}} is first linearly projected to the Fast AR’s dimension and placed at position 0 as a conditioning prefix, providing global context from the Slow AR. The semantic token qt(0)q^{(0)}_{t}, already determined by the Slow AR, is then embedded and placed at position 1 as the seed input. The Fast AR then autoregressively generates qt(1)q^{(1)}_{t} through qt(N−1)q^{(N-1)}_{t}, where each step conditions on the conditioning prefix 𝐡tslow\mathbf{h}_{t}^{\text{slow}} and all previously generated tokens. All NN codebook layers share a single embedding table within the Fast AR; the codebook layer identity is encoded through RoPE positional embeddings. This highly asymmetric design—a 4B-parameter model along the time axis and a 4-layer network along the codebook depth axis—ensures high inference efficiency.

Multi-Codebook Fusion (MCF). After all NN codebook tokens for time step tt have been generated, they are aggregated into a single continuous vector 𝐱t+1\mathbf{x}_{t+1} to serve as the Slow AR’s input embedding for the next time step t+1t+1. Each token qt(k)q^{(k)}_{t} (k∈{0,1,…,N−1}k\in\{0,1,\dots,N-1\}) is embedded via a dedicated embedding layer 𝐄(k)\mathbf{E}^{(k)} that maps codebook indices into the Slow AR’s embedding space. These NN codebook embeddings, together with the Slow AR’s own token embedding 𝐞tLM\mathbf{e}^{\text{LM}}_{t} for the semantic token qt(0)q^{(0)}_{t}, are summed:

| 𝐱t+1=𝐞tLM+∑k=0N−1𝐄(k)​[qt(k)],\mathbf{x}_{t+1}=\mathbf{e}^{\text{LM}}_{t}+\sum_{k=0}^{N-1}\mathbf{E}^{(k)}\bigl[q^{(k)}_{t}\bigr], | (1) |
| --- | --- |

where N=10N=10 is the total number of codebooks. Note that the semantic token qt(0)q^{(0)}_{t} contributes two distinct representations: 𝐞tLM\mathbf{e}^{\text{LM}}_{t} from the Slow AR’s token embedding layer, and 𝐄(0)​[qt(0)]\mathbf{E}^{(0)}[q^{(0)}_{t}] from the codebook embedding layer. These two embedding tables are independently parameterized and capture complementary aspects of the same token.

## 3 Data Pipeline

Figure 3: Fish Audio S2 data pipeline.

Scaling TTS systems requires massive, high-quality datasets. Beyond basic noise reduction, the primary bottleneck lies in mapping subtle acoustic attributes (e.g., emotion and prosody) and speaker turns to natural language instructions—a process that is infeasible to scale manually. Moreover, RL alignment for TTS typically relies on reward models trained independently from the pre-training pipeline, which can introduce distribution shift between pre-training data and post-training objectives.

To address both challenges, we design a three-stage data curation pipeline built around two core evaluation engines: a speech quality model and a rich-transcription ASR model. During pre-training, these engines act as strict filters and annotators; during RL-based post-training, they are directly reused as reward models. This dual-purpose design eliminates distribution shift between pre-training and post-training by construction, while enabling fine-grained vocal annotation in natural language to scale automatically without human intervention.

To process raw audio into speech-text pairs with fine-grained vocal annotations, our pipeline executes three stages (Figure 3):

•

Stage 1: Source Separation and Segmentation. We apply a vocal separation module to isolate clean speech from background noise, followed by Voice Activity Detection (VAD) to slice continuous audio into utterance-level segments.

•

Stage 2: Quality Filtering. Our core speech quality model evaluates each utterance across multiple dimensions—including signal-to-noise ratio, speaker consistency, recording quality, and intelligibility—to filter out low-fidelity samples.

•

Stage 3: Rich Transcription. An in-house ASR model generates highly accurate transcripts. This model simultaneously transcribes long-form spoken text and captions vocal features (e.g., emotion, prosody, paralinguistic) and speaker turns, creating descriptive natural language captions that directly enable the model’s zero-shot instruction-following capabilities.

### 3.1 Speech Quality Model

Following the architectural design of Uni-VERSA (Shi et al., 2025), our speech quality model utilizes a pre-trained w2v-BERT 2.0 backbone coupled with a multi-layer perceptron head for acoustic evaluation. We train this network on a proprietary dataset of thousands of hours of Stage 1 audio with speech quality labels provided by human annotators, using a combined objective of MSE and focal loss (Lin et al., 2017). In Stage 2, this model acts as a strict filter, removing low-quality samples that slip through Stage 1, such as overlapping voices and residual background music, significantly reducing artifacts such as timbre inconsistency in the pre-training data. Consistent with our dual-purpose design, this same model is reused during the RL phase as an objective acoustic reward, penalizing noise and artifacts in the generated speech.

### 3.2 Rich-Transcription ASR Model

We develop a rich-transcription ASR model by fine-tuning the Qwen3-Omni-30B-A3B foundation model to jointly transcribe spoken content and annotate speaker turns and vocal events. The training data were curated using a video-based pseudo-labeling approach, followed by human verification to ensure annotation accuracy.

In Stage 3, this model jointly transcribes spoken content and annotates vocal features as natural language instructions. Specifically, it predicts speaker turns (e.g., <|speaker:0|>) and injects vocal instructions such as [prolonged laugh], [inhale], [angry], [emphasis] and [in a hurry] directly into the text stream alongside natural disfluencies. An example of the output format is shown in Figure 4. These transcripts serve as the fine-grained natural language instructions for training the zero-shot controllable generation capabilities of Fish Audio S2.

Consistent with our dual-purpose design, this model is reused during RL-based post-training as an intelligibility and instruction-following reward. By re-transcribing the generated audio and comparing it against the original prompt, it provides reward signals that penalize hallucinations, missing words, and ignored vocal instructions.

Figure 4: Fish Audio S2 supports multi-speaker generation with fine-grained natural language control over prosody, emotion, and speaking style.

## 4 Training

The training pipeline of Fish Audio S2 proceeds in four stages. We first train the audio tokenizer to obtain discrete audio representations, then progressively align the LLM with these discrete representations through large-scale pre-training and supervised fine-tuning (SFT) on curated data, and finally refine generation quality via RL-based post-training.

### 4.1 Audio Tokenizer Training

The complete audio tokenizer, totaling 446M parameters, is trained for 1M steps. We employ a composite GAN loss framework to ensure perceptual fidelity, utilizing three distinct discriminators: a multi-period discriminator to capture periodic signals, a multi-resolution discriminator for spectral consistency, and a multi-scale STFT discriminator to ensure high-frequency detail and phase coherence.

### 4.2 Pre-training and SFT

The pre-training phase aligns the audio tokens with the Qwen3-4B foundation model through two progressive stages. The first stage establishes cross-modal alignment with a maximum context length of 8,192 tokens; the second stage extends the context to 16,384 tokens, enabling long-form audio synthesis and multi-turn multi-speaker conversational generation. In total, the pre-training phase utilizes over 10 million hours of raw audio across approximately 80 languages and dialects. After pre-training, we perform SFT on curated internal high-quality labelled data to improve expressiveness and controllability.

We expanded the Qwen3-4B vocabulary with structural control tokens and 4,096 semantic tokens. To ensure a smooth feature space transition, the new token embeddings are initialized by sampling from a multivariate normal distribution 𝒩​(μ,Σ)\mathcal{N}(\mu,\Sigma), where μ\mu and Σ\Sigma are the empirical mean and covariance of the existing text embedding matrix.

Unlike Fish Audio S1, which appends reference audio to the user input, S2 prepends the reference audio to the system prompt. The loss of the reference audio tokens is masked during training to prevent verbatim memorization. For fine-grained acoustic control, rather than relying on lengthy global prompts, we inject descriptive instructions at specific word or phrase positions within the dialogue context, enabling precise localized control over acoustic details. These instructions take the form of natural language—such as whisper, angry, and laugh—embedded directly in the token sequence. Through autoregressive training on large-scale data, the model naturally internalizes the mapping between these textual cues and localized acoustic variations without requiring dedicated control tokens.

The training objective follows standard autoregressive language modeling, adapted for the Dual-AR architecture with a separate loss for each component. For the Slow AR, the training objective is defined as:

| ℒslow=−∑t=0T−1mt​λt​log⁡P​(xt∣x