# Upsampling Strategies for X-Codec 2.0: From 16kHz to High-Fidelity Audio

## Executive Summary

X-Codec 2.0's 16kHz output (8kHz Nyquist) produces a noticeable "muffled" quality because it lacks the 8–24kHz frequency content that gives speech its brightness, air, and presence. This report systematically evaluates every viable approach to upgrading the output resolution—from minimal decoder-only modifications to full bandwidth extension post-processors—prioritizing methods that preserve low-latency inference for voice agent and TTS deployment. A recent paper has already demonstrated that X-Codec 2.0 can be upgraded to 24kHz output by only retraining the decoder with frozen encoders, validating the core strategy of decoder-side modification. Beyond that baseline, this report covers GAN-based bandwidth extension, flow matching post-filters, ISTFT head redesign, multi-band synthesis, and controllable multi-rate output architectures.[1][2]

## Strategy 1: Decoder-Only Retraining with Modified ISTFT Parameters

### The Proven Baseline: X-Codec 2.0 → 24kHz

A January 2026 paper titled "Improving X-Codec-2.0 for Multi-Lingual Speech" directly addresses this exact problem. The approach is strikingly simple:[2]

- **Freeze all encoder components** (CodecEncoder, Wav2Vec2-BERT, SemanticEncoder) and the quantizer
- **Increase the decoder hop size** from 320 to 960 samples
- **Add an AvgPool1d(kernel=2, stride=2)** before vector quantization to reduce the latent rate from 50Hz to 25Hz
- **Retrain only the decoder** (VocosBackbone + ISTFTHead) to output 24kHz audio
- **Interpolate decoder weights** from the pretrained checkpoint using 1D linear interpolation on the output projection

This achieved a **+0.29 MOS improvement** on UTMOSv2 over the original 16kHz X-Codec 2.0, evaluated across 116 languages on Common Voice 17. Training was done on 2× RTX 3090 Ti GPUs for 3M steps with batch size 20/device, reusing the original loss formulation (\(\lambda_{\text{mel}}=15, \lambda_{\text{adv}}=1, \lambda_{\text{sem}}=5\)).[1][2]

### Extending to 48kHz: Direct ISTFT Head Modification

Since your decoder already uses a Vocos-style ISTFT head, extending to 48kHz follows the same architectural principle with parameter scaling:

| Parameter | 16kHz (Current) | 24kHz (Proven) | 48kHz (Proposed) |
|-----------|-----------------|----------------|------------------|
| `sample_rate` | 16,000 | 24,000 | 48,000 |
| `hop_length` | 320 | 960 | 960 |
| `n_fft` | 1280 | ~1920 | 3840 |
| `ISTFT output dim` | 1282 | ~1922 | 3842 |
| Freq bins | 641 | ~961 | 1921 |
| Freq resolution | 12.5 Hz | 12.5 Hz | 12.5 Hz |
| Latent rate | 50 Hz | 25 Hz | 50 Hz |
| Nyquist | 8 kHz | 12 kHz | 24 kHz |

The key insight from the Vocos architecture is that the ISTFT-based decoder operates entirely at the compressed frame rate—there are no upsampling convolutions—so increasing the sample rate primarily means the ISTFTHead must output more frequency bins. The VocosBackbone transformer still operates at the same temporal resolution; only its final linear projection and the ISTFT parameters change.[3]

**Practical implementation:** Modify the ISTFTHead's final `Linear(1024 → n_fft+2)` to `Linear(1024 → 3842)`, update `n_fft=3840, hop_length=960, win_length=3840`, and retrain. You may need to increase the VocosBackbone's hidden dimension (from 1024 to 1536 or 2048) or add transformer layers to provide enough capacity for the wider frequency range, since each frame now must predict 1921 magnitude + 1921 phase bins instead of 641+641.

**Important caveat:** Experiments by other researchers modifying Vocos `n_fft` and `hop_length` parameters have sometimes yielded poor results, particularly with aggressive compression ratios. The Vocos architecture was specifically designed with the relationship between its hidden dimension and STFT parameters carefully tuned. You should expect to need careful hyperparameter exploration and likely a wider backbone to make 48kHz work.[4]

## Strategy 2: GAN-Based Bandwidth Extension Post-Filter

### AP-BWE: The Speed Champion

AP-BWE (Amplitude and Phase Bandwidth Extension) is a pure CNN-based GAN that directly extends narrowband spectra to wideband. It is the most promising non-diffusion post-processor for your use case:[5]

- **Architecture:** Dual-stream ConvNeXt backbone with parallel amplitude and phase prediction streams plus inter-stream connections
- **Speed:** 292.3× faster than real-time on a single RTX 4090 GPU, and 18.1× faster on a single CPU core for 16kHz→48kHz[5]
- **Quality:** State-of-the-art on both 16kHz and 48kHz target rates
- **Waveform synthesis:** Direct iSTFT from predicted amplitude + phase (no external vocoder needed)
- **Parameter count:** Lightweight all-CNN architecture with no attention layers

The dual-stream architecture predicts the **residual** high-frequency log-amplitude spectrum (added to the original), while the phase stream predicts "pseudo-real" and "pseudo-imaginary" components to handle phase wrapping. Discriminators operate at both waveform level (MPD) and spectral level (multi-resolution amplitude and phase discriminators).[6]

**Integration approach:** Run X-Codec 2.0 decoder at 16kHz → upsample to 48kHz with sinc interpolation → feed through AP-BWE to fill in 8–24kHz content. Since AP-BWE operates at frame level on STFT representations, its latency is minimal (one STFT frame).

### MS-BWE: Multi-Stage Cascading

MS-BWE extends AP-BWE with cascading BWE blocks for flexible multi-rate extension:[7]

- Each stage extends one frequency band (e.g., 16kHz→24kHz→48kHz)
- One-stage GPU speed: ~1271× real-time (4× faster than full AP-BWE per stage)
- Enables **controllable output rate**: stop at any intermediate stage for 24kHz, or run all stages for 48kHz

This directly supports your goal of conditional 16/24/48kHz output with gating.

### HiFi-GAN+ (Bandwidth Extension Is All You Need)

The Princeton approach uses a feed-forward WaveNet with multi-domain adversarial training for 8–16kHz→48kHz extension. HiFi-GAN-2 demonstrated a complete pipeline where a 16kHz speech enhancement model is followed by a dedicated BWE stage to reach 48kHz studio quality, achieving MOS of 4.27±0.03. An open-source reproduction is available at `bshall/hifi-gan-bwe`.[8][9][10]

### UBGAN: Ultra-Lightweight

UBGAN operates in the PQMF sub-band domain and is specifically designed as a modular post-processor for existing codecs:[11][12]

- **Causal** with 20ms frame size and 5ms look-ahead
- Significantly fewer parameters and lower complexity than AP-BWE
- Both blind (no side information) and guided (with small side channel) variants
- Designed to generalize across multiple codec types and bitrates

This is particularly appealing for voice agent deployment where you want to add BWE without modifying the codec itself.

## Strategy 3: Flow Matching Post-Filter (Low-Overhead Generative)

### FLowHigh: Single-Step Flow Matching

FLowHigh achieves SOTA audio super-resolution quality with **just 1 ODE step** using conditional flow matching:[13][14]

- **Architecture:** 2-layer transformer (16 heads, 1024 dim, 4096 FFN) = 35.4M parameters
- **RTF:** 0.1769 (single step)—13.6–24.4× faster than diffusion models requiring 50–100 steps[14]
- **Quality:** Outperforms NVSR, AudioSR, Nu-wave2, UDM+, mdctGAN, and Fre-painter across all input rates on the VCTK benchmark
- **External vocoder:** Uses BigVGAN (pretrained at 48kHz, 256 mel bins) for final waveform synthesis

| Model | NFEs | RTF | LSD (16→48) | ViSQOL (16→48) |
|-------|------|-----|-------------|----------------|
| Nu-wave2 | 50 | 1.2337 | 0.86 | 3.00 |
| UDM+ | 50 | 2.2415 | 0.94 | 2.77 |
| **FLowHigh** | **1** | **0.1769** | **0.71** | **3.80** |

FLowHigh operates at mel-spectrogram level: it takes a 16kHz mel-spectrogram temporally interpolated to 48kHz length, applies a single flow matching step to predict the high-resolution mel, then synthesizes with BigVGAN. The post-processing step copies original low-frequency content from the input, so only high-frequency bands are generated.[14]

**Trade-off:** This adds a transformer forward pass + BigVGAN synthesis. However, since it's **single-step** (not iterative), the overhead is bounded and predictable—roughly equivalent to one additional vocoder pass.

### FlowDec: Integrated Codec Post-Filter

FlowDec (ICLR 2025) combines non-adversarial codec training with a flow matching post-filter for 48kHz general audio:[15][16]

- Reduces required DNN evaluations from 60 (ScoreDec) to **6** without distillation
- Achieves FAD scores better than DAC at 48kHz
- Operates at 4–7.5 kbps
- Designed as an integral part of the codec, not a separate module

### CodecFlow: Latent-Space BWE

CodecFlow (March 2026) performs bandwidth extension directly in the codec's continuous latent space using voicing-aware conditional flow matching:[17]

- Exploits the observation that LR and HR speech remain closely aligned in continuous codec latent space
- Uses a structure-constrained RVQ for improved latent alignment
- End-to-end fine-tuned with the codec
- Achieves strong spectral fidelity on both 8→16kHz and 8→44.1kHz tasks

This is conceptually the most elegant approach for your architecture—performing BWE in X-Codec 2.0's latent space (post-VQ, pre-decoder) would avoid any waveform-domain processing overhead.

## Strategy 4: Modified Decoder Architecture

### Multi-Band iSTFT Synthesis

iSTFTNet2 introduced 2D CNNs for frequency upsampling within the ISTFT framework:[18]

- Operates in a few-frequency space, then upsamples in frequency dimension
- Faster and more lightweight than standard iSTFTNet
- Avoids the quality degradation seen when replacing too many upsampling layers with iSTFT

For your VocosBackbone, this could be implemented as: keep the transformer at 1024 dim operating at frame rate, then add a lightweight 2D CNN that upsamples the frequency dimension before the final iSTFT. This separates temporal modeling (transformer) from spectral filling (CNN).

**Multi-band variant (MB-iSTFT):** Split the target waveform into sub-bands using PQMF, predict each sub-band's STFT independently with smaller n_fft, then recombine. MS-Vocos applies iSTFT with reduced frame length and shift (1/4 of standard), then combines reconstructed sub-band waveforms. This dramatically reduces the per-head output dimension while maintaining full-band output.[19][20][21]

### WaveNeXt: Replace iSTFT with Learned Linear

WaveNeXt replaces the iSTFT layer with a trainable linear projection (no bias), mapping directly from the ConvNeXt output to waveform samples. For 48kHz models, the architecture was adapted by changing the output channel to match the new shift length. Results showed WaveNeXt achieves higher synthesis quality than Vocos while preserving inference speed, though it underperformed HiFi-GAN for some full-band E2E TTS conditions.[22]

### ComplexDec: Complex Spectral Domain

ComplexDec (Meta, 2025) operates entirely in the complex spectral domain at 48kHz with no up/downsampling layers:[23][24]

- Uses complex STFT input/output (320 hop, 510 n_fft, Hann window)
- 150 Hz frame rate with 256 embedding dimension
- 16 codebooks at 24 kbps
- Avoids information loss from waveform-domain compression
- Demonstrates strong out-of-domain robustness

This represents a clean-slate approach for your decoder side, though it would require more substantial retraining.

## Strategy 5: Controllable Multi-Rate Output

### Gated Resolution with Conditional ISTFT Head

To achieve selectable 16/24/32/48kHz output, consider a multi-head architecture:

1. **Shared VocosBackbone** processes the VQ output at frame rate
2. **Multiple ISTFT heads** with different `n_fft`/`hop_length` configurations
3. **Gated selection** at inference time based on desired output resolution

Each head is a single `Linear(hidden_dim → n_fft+2)` + ISTFT, so the parameter overhead per resolution is minimal (~2–8M per head depending on n_fft). Train all heads simultaneously with the same backbone using multi-resolution mel loss.

### MS-BWE Cascading for Progressive Enhancement

AP-BWE's multi-stage variant provides a natural gating mechanism:[7]

```
16kHz (base) → [BWE Block 1] → 24kHz → [BWE Block 2] → 48kHz
```

Each stage independently extends one frequency band. At inference, you stop at the desired resolution. The per-stage overhead is approximately 1/4 of the full AP-BWE model.[7]

### FlexiCodec-Style Dynamic Rate

FlexiCodec demonstrates controllable frame rates between 3Hz and 12.5Hz using a merging threshold parameter. A similar gating concept could be applied to the output resolution, where a conditioning signal controls how many frequency bands the decoder reconstructs.[25]

## Recommended Implementation Roadmap

### Phase 1: Quick Win (1–2 weeks)

**Replicate the proven 24kHz modification**:[2]
- Freeze encoders + quantizer
- Add AvgPool1d(k=2, stride=2) before quantization
- Increase hop to 960, update ISTFT params for 24kHz
- Interpolate decoder weights from pretrained checkpoint
- Fine-tune decoder on your HiFi-TTS2 pre-encoded data
- Expected: +0.29 MOS, 12kHz Nyquist, eliminates most "muffled" quality

### Phase 2: Full-Band Post-Filter (2–4 weeks)

**Add AP-BWE or FLowHigh as a lightweight post-processor:**

| Method | Latency Added | Quality | Complexity |
|--------|--------------|---------|------------|
| AP-BWE | ~1ms (GPU) | Excellent | 292× RT, all-CNN |
| FLowHigh (1 step) | ~18ms (GPU) | SOTA | 35.4M params + BigVGAN |
| MS-BWE (1 stage) | ~0.8ms (GPU) | Very good | ~1271× RT per stage |
| UBGAN blind | <1ms | Good | Ultra-lightweight, causal |

AP-BWE is the strongest recommendation for your voice agent use case: it is pure CNN (no attention overhead), processes frame-by-frame, and achieves near-studio quality at 48kHz. You can train it on paired 16kHz/48kHz data from HiFi-TTS.[5]

### Phase 3: Integrated Decoder Redesign (4–8 weeks)

**Retrain decoder from scratch for native 48kHz:**
- Widen VocosBackbone to 1536 or 2048 hidden dim
- Increase transformer layers from 12 to 16
- Set ISTFT head to n_fft=3840, hop=960, producing 1921 freq bins
- Add multi-scale sub-band CQT discriminator (from BigVGAN-v2)
- Train with multi-resolution mel loss at 48kHz

### Phase 4: Multi-Rate Gating (Optional, +2 weeks on Phase 3)

- Add parallel ISTFT heads for 16k/24k/48k
- Condition on a resolution embedding fed into the VocosBackbone
- Train jointly with random resolution sampling per batch

## Methods Comparison

| Approach | Output SR | Retraining Scope | Added Inference Cost | Quality Potential | Implementation Difficulty |
|----------|----------|-------------------|---------------------|-------------------|--------------------------|
| Decoder retrain → 24kHz | 24kHz | Decoder only | None | Good (+0.29 MOS)[2] | Low |
| Decoder retrain → 48kHz | 48kHz | Decoder + wider backbone | None | Good–Excellent | Medium |
| AP-BWE post-filter | 48kHz | Separate model | ~1ms/utterance | Excellent[5] | Low–Medium |
| FLowHigh post-filter | 48kHz | Separate model | ~18ms/utterance | SOTA[14] | Medium |
| MS-BWE cascade | 16–48kHz | Separate model | 0.8ms/stage | Very Good[7] | Medium |
| Latent-space flow (CodecFlow) | 44.1kHz+ | End-to-end fine-tune | 6 NFEs | Excellent[17] | High |
| Multi-head decoder | 16–48kHz | Full decoder | None (head selection) | Good–Excellent | Medium–High |
| MB-iSTFT sub-band | 48kHz | Full decoder | None | Good | Medium |

## Key Takeaways

The "muffled" quality issue is entirely solvable. The most pragmatic path is a two-phase approach: first, apply the proven decoder-only modification to reach 24kHz (which immediately eliminates most perceptual degradation), then layer on AP-BWE or a single-step FLowHigh post-filter for full 48kHz when needed. This modular design lets you ship the 24kHz improvement quickly while developing the full-band solution in parallel.[14][2][5]

For controllable multi-rate output, MS-BWE's cascading architecture offers the cleanest solution—each stage is independently trainable and adds minimal latency. Alternatively, a multi-head ISTFT decoder with resolution conditioning provides zero-overhead rate switching but requires more substantial retraining.[7]

Critically, since you have pre-encoded data from HiFi-TTS2 and the encoder is frozen, all decoder-side experiments can reuse the same latent codes. This dramatically reduces iteration time—you only need to train the decoder components, which are ~100M parameters (VocosBackbone) plus the heads, not the full 580M Wav2Vec2-BERT pipeline.