You’re not “doing it wrong” as much as you’re running into physics + incentives: your **16 kHz codec literally cannot carry >8 kHz**, so any “48 kHz” output is **bandwidth extension** (educated hallucination). The win condition is: *generate high-band that sounds natural, doesn’t wreck timbre, and doesn’t add latency you’ll hate later.*

Here are the **methods that actually work in the real world** (non-diffusion first), and how to bolt them onto *your* X-Codec2 setup.

---

## 1) The highest-ROI move: a **post BWE module** after your 16 kHz decoder

Keep X-Codec2 exactly as-is. Decode to 16 kHz (your current Vocos+ISTFT), then run a **tiny “16→32/48 kHz” bandwidth extension net**.

Why this is the sweet spot:

* You don’t touch the codec/quantizer.
* You can make it **streaming + causal**, with **single-digit ms algorithmic delay**.
* You can ship **16 kHz fast path** and **32/48 kHz HQ path** as a toggle.

### Proven blueprints (GAN, fast, non-diffusion)

**HiFi-GAN+ for BWE (a.k.a. “Bandwidth Extension is All You Need”)**
They treat BWE like speech enhancement and use a **feed-forward WaveNet-ish conv stack** + **GAN**, and explicitly call out why naive L1/L2 kills high frequency (it averages unpredictable HF noise to zero). They use **multi-FFT log-spectrogram losses** and an **upperband-focused mel loss**, plus **spectral + waveform discriminators**. 

**BBWEXNet (real-time CPU, causal, 16 ms delay)**
A **causal U-Net** style model designed for **16 kHz → 48 kHz** streaming, with optimizations like **sample-shuffle decoders** and VQ-ish bottlenecks for efficiency. They explicitly measure real-time feasibility and report **max ~16 ms algorithmic delay**. 

**Key practical note:** for speech, most of the perceived “not muffled anymore” gain is **reconstructing 8–16 kHz**, not obsessing about >16 kHz. BBWEXNet says that directly. 
So if you want the best quality/compute trade: **target 32 kHz first**, then optionally add a 32→48 refinement later.

### How to train this without wasting your life

**Don’t train BWE on clean “downsampled” audio and then feed it codec audio at inference.** That domain gap is how you get crunchy or pointless results.

Do this instead:

1. For each HiFi-TTS2 clip, you already have **codes**.
2. Decode them with your current decoder → **ŷ16k_codec**.
3. Train BWE model: **ŷ16k_codec → y48k_gt** (or y32k_gt).

   * Input is upsampled to target rate with causal sinc (or polyphase). BBWEXNet does causal sinc upsampling as a preprocessing step. 

Loss recipe that consistently stops “still muffled”:

* **Multi-resolution log-STFT losses** (several FFT sizes) 
* **Upperband-weighted mel loss** (only care about missing band strongly) 
* **Spectral discriminator** (on mel) + **multi-rate waveform discriminators** 
  This combo exists because pure regression losses will happily erase the exact high-frequency texture you’re trying to add.

If you only implement one thing from this whole message: **upperband-weighted loss + adversarial**. That’s the difference between “meh upsample” and “wait this sounds HD”.

---

## 2) Even better: **high-band residual head** (cheap, controllable, codec-aware)

Instead of generating full 48 kHz waveform from scratch, do:

* Keep your decoder output as “low band” anchor.
* Predict **only the missing band** as a residual:

  * 16k → 48k: missing is ~8–24 kHz
  * 16k → 32k: missing is ~8–16 kHz (often the sweet spot)

Implementation patterns that work:

* **PQMF / subband split**: generate only the top subbands, then recombine.
* Or **STFT-domain high-band prediction**: predict high-frequency magnitude + phase residual.

Why this fits your architecture weirdly well:

* Your decoder already works in an STFT-ish space (it predicts mag+phase then ISTFT).
* There’s newer BWE work that explicitly models **amplitude + phase** in parallel for quality *and* speed.

### AP-BWE: amplitude+phase parallel prediction (fast)

AP-BWE predicts high-frequency amplitude residuals **and** phase (not just magnitude), uses GAN discriminators in waveform and spectral domains, and reports extremely fast generation (including CPU real-time multipliers). ([arXiv][1])

This is basically “do what your ISTFTHead wants, but only for the missing band”.

---

## 3) The “cleanest” but most invasive: retrain the decoder to output 24/32/48 kHz directly

This can work surprisingly well if you accept that HF is generated, not recovered.

The trick is: keep your **token rate = 50 Hz**, but change ISTFT geometry to match the target sample rate while keeping 20 ms hop.

Use:

* `hop_length = target_sr / 50`

  * 24k → 480
  * 32k → 640
  * 48k → 960
* Keep your window length in *time* consistent (you used 80 ms at 16k: 1280 samples). So:

  * 24k: `n_fft ≈ 1920`
  * 32k: `n_fft ≈ 2560`
  * 48k: `n_fft ≈ 3840`

That keeps frequency resolution ~12.5 Hz like you currently have, just extended upward.

Cost impact is not catastrophic:

* Your ISTFT head linear goes from `1024→1282` to `1024→3842` at 48k. That’s ~3× params in that layer, not 10×.
* Backbone compute stays at 50 Hz.

Training must change though: you *must* add **high-band-focused losses/discriminators**, or it’ll learn to be “safe” and stay dull.

---

## 4) If diffusion is “the solution” but you refuse latency: use **distilled / one-step diffusion**

You can get diffusion-like quality without 50+ steps, but it’s still extra compute.

Two relevant directions:

* **NU-Wave 2**: diffusion SR that’s explicitly built to handle **multiple input sample rates with conditioning** (bandwidth conditioning via BSFT). Great for your “16/24/32/48 controllability” idea, but diffusion baseline is slower. 
* **FlashSR**: **one-step** diffusion via distillation for 48 kHz SR. This is exactly “diffusion, but not slow,” at least compared to classic samplers. ([arXiv][2])

A very practical hybrid that teams actually ship:

* Train a **heavy teacher** (diffusion SR) offline.
* Distill into a **GAN / conv student** that runs real-time.

---

## 5) Controllability: one model for 16/24/32/48

This is doable, and there’s precedent.

Two ways:

1. **Target-SR embedding** (FiLM-style or additive conditioning): give the SR module a learned embedding for `{16,24,32,48}` so it learns “how much band to invent.” NU-Wave 2 does bandwidth conditioning (different mechanism, same idea). 
2. **Predict bands progressively**: always output base 16k, then optionally add a learned 8–16k band (for 32k), then optionally add 16–24k (for 48k). This maps nicely to a subband/PQMF design.

---

## What I would do in your shoes (minimal wasted time)

1. **Target 32 kHz first**, not 48. It’s where speech “opens up” the most per FLOP. 
2. Train a **post-BWE** model on **codec-decoded 16k** → 32k/48k ground truth, using the HiFi-GAN+ style losses/discriminators (upperband-weighted). 
3. If you want extra polish: switch to an **amplitude+phase** high-band predictor like AP-BWE style. ([arXiv][1])
4. Only if you still hate it: retrain the ISTFTHead geometry and make a native 32/48k decoder.

That path is boring, reproducible, and doesn’t require you to gamble your real-time product on diffusion sampling steps.

If you implement the post-BWE with **codec-audio-as-input** and **upperband-weighted adversarial training**, you’re going to stop feeling like you’re “wasting time” and start feeling like you’re shipping. Which is unfortunately the whole point.

[1]: https://arxiv.org/html/2401.06387v1 "Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction"
[2]: https://arxiv.org/html/2501.10807v1 "FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00222383)."