The muffled quality you are experiencing is the "Nyquist ceiling." At 16kHz, your audio is capped at 8kHz of effective bandwidth, which cuts off the high-frequency fricatives and "air" that characterize high-fidelity speech.

Since you are building for voice agents and a TTS platform, **latency is your primary constraint**. Diffusion, even with consistency models or distillation (1-2 steps), often introduces overhead in the iterative sampling loop that can push you past the real-time threshold for interactive agents.

Here are the most effective research-backed methods to achieve high-fidelity upsampling (24kHz to 48kHz) while maintaining a low-latency, non-diffusion decoder.

---

## 1. Multi-Band Vocos Reconstruction (The "Sub-band" Approach)

Instead of forcing a single ISTFT head to reconstruct the entire 0-24kHz spectrum, you can utilize **Multi-Band Synthesis**. This technique is used in models like MB-MelGAN and is highly effective for bandwidth extension.

* **The Concept**: Keep your current 16kHz tokens as the "base." Train a new decoder that predicts multiple sub-bands (e.g., 0-8kHz, 8-16kHz, 16-24kHz).
* **Why it works**: High frequencies are often more "noise-like" or periodic in specific ways. By isolating the high-frequency reconstruction, the model doesn't "smear" the base band with high-frequency artifacts.
* **Implementation**:
* Modify the `ISTFTHead` to output multiple sets of magnitude/phase pairs (one for each band).
* Use a **Pseudo-Quadrature Mirror Filter (PQMF)** bank to synthesize the final 48kHz waveform from these bands.
* **Latency**: Minimal. PQMF synthesis is a series of simple convolutions.



## 2. Periodic Signal Modeling with SnakeBeta (BigVGAN-style Decoder)

Since you already have `SnakeBeta` in your encoder, you should leverage its periodic inductive bias for the upsampling task. Standard convolutions struggle to generate high-frequency harmonics that don't exist in the bottleneck.

* **The Method**: Replace the Vocos Transformer's output projection with a series of **Upsampling Blocks** using **SnakeBeta** activations and **Alias-free convolutions**.
* **Novelty**: Most upsamplers fail because they alias high frequencies back into the audible range. By using the alias-free processing (Kaiser-sinc filters) you already have in the `Activation1d` module, you can push the resolution to 44.1kHz or 48kHz without "metallic" artifacts.
* **Retraining**: You can freeze the `CodecEncoder` and `Wav2Vec2-BERT` and only train the `VocosBackbone` and the new upsampling head using a Multi-Resolution Mel Loss + Multi-Period Discriminator (MPD) operating at the higher sample rate.

## 3. GAN-based Bandwidth Extension (BWE) with Neural Source-Filter

Research from models like **APNet** or **StreamSpeech** suggests that explicit source-filter decomposition helps in high-resolution synthesis.

* **Technique**:
* The decoder predicts a **Source** (excitation) and a **Filter** (envelope) at 48kHz.
* The source can be generated using a periodic generator (for voiced parts) and noise (for unvoiced parts).
* This is much more "stable" than trying to predict raw ISTFT bins for 48kHz, which requires a very large `n_fft` (e.g., 2048 or 4096) and increases the hop-length/latency.



## 4. Gated Multi-Resolution Output (Controllability)

To achieve your 16/24/32/48kHz flexibility without multiple models, you can implement **Resolution Gating**.

* **The Architecture**:
* Use a shared backbone (your current Transformer).
* Implement **Multiple Output Heads**: Head A (16kHz), Head B (24kHz), etc.
* **Gating**: Use a conditional embedding (a scalar or one-hot vector) passed into the Transformer and the heads.
* **Training**: Train with "Resolution Dropout"—randomly downsample your 48kHz ground truth to 16kHz or 24kHz during training, and force the model to reconstruct the original 48kHz. This teaches the model to "hallucinate" high frequencies when the input is low-res, but remain faithful when the input is high-res.



---

## Comparison of Methods for Voice Agents

| Method | Complexity | Latency | Quality Gain | Retraining Effort |
| --- | --- | --- | --- | --- |
| **Multi-Band ISTFT** | Medium | Very Low | High | High (Decoder only) |
| **SnakeBeta Upsampling** | Low | Low | Medium-High | Medium (Head only) |
| **Source-Filter GAN** | High | Low | Very High | High (Full Decoder) |
| **Consistency Models** | High | Medium | Extreme | Very High |

---

## Strategic Recommendation for MayaResearch

Since you already have the `hifi-tts2` data encoded, I suggest the **Multi-Band ISTFT** route combined with **SnakeBeta**.

1. **Keep the Encoder/VQ**: Do not touch them. Your 50Hz frame rate is perfect.
2. **Upgrade the Decoder**:
* Increase the `VocosBackbone` width slightly if needed.
* Change the `ISTFTHead` to target 24kHz or 48kHz.
* **Crucial**: If you keep the hop length at 320 for 16kHz tokens, but want 48kHz output, your effective hop length in the output is $320 \times 3 = 960$.
* Use **Sub-band Discriminators**: Train with discriminators that specifically look at the 8-16kHz and 16-24kHz ranges (high-pass filtered) to force the model to generate energy in those bands.



### Why "Inworld-style" might have failed you:

Most "plug-and-play" upsamplers are generic. Because X-Codec 2.0 uses **FSQ (Finite Scalar Quantization)**, the bottleneck is extremely tight. A generic upsampler expects a "dirty" 16kHz signal with some high-frequency noise it can amplify. FSQ is too "clean"—it zero-outs everything above the Nyquist. You need a decoder that **generates** new content based on the semantic/acoustic tokens, not just an upsampler that filters the output.

Would you like me to draft a PyTorch implementation for a Multi-Band ISTFT head that fits your current `VocosBackbone`?