You’re not doing something obviously dumb. The problem is the framing.

What you have is not really “upsampling” in the magical recover-the-lost-bits sense. Your current codec was trained to reconstruct 16 kHz audio, so anything above 8 kHz is gone and has to be *re-synthesized plausibly*, not recovered exactly. That is why blind post-hoc waveform upsamplers so often sound airy-but-fake, or just weirdly tunnel-y in a new costume. The better results in recent work come from moving the bandwidth extension closer to the codec decoder or even into the latent space, where the model still has token, speaker, prosody, and voicing context. There is even a January 2026 X-Codec-2.0 technical report that did almost exactly the kind of surgery you’re considering: froze the encoder, changed the hop size, fine-tuned only the decoder, and reported a +0.29 UTMOSv2 gain for 24 kHz output. Promising clue, not holy scripture, but definitely a signal. ([arXiv][1])

My blunt take: for your use case, the best path is **decoder-aware bandwidth extension**, not another blind waveform SR model hanging off the end like a decorative tax. If I had to rank the practical options for a voice-agent/TTS stack, it would be:

1. **24 kHz decoder-only retrain first**
2. **Residual high-band restorer conditioned on codec latents**
3. **A stronger non-diffusion decoder replacement with harmonic/phase priors**
4. **Single-step latent flow or distilled diffusion only as an optional HQ tier**

### What has actually worked in the literature

Three families keep showing up when people stop making charts and start making things that sound good.

**First:** direct spectral decoders. APNet2, APCodec, and ComplexDec all lean into direct amplitude/phase or complex-spectrum modeling instead of long waveform upsampling stacks. APNet2 reports better speech quality than APNet and Vocos at 22.05 kHz, APCodec targets streamable 48 kHz audio with 6.67 ms fixed latency, and ComplexDec pushes full-band 48 kHz codec decoding with complex spectral I/O for better robustness. ([arXiv][2])

**Second:** harmonic or source-filter priors. This is a big one for speech. Comparative work on fast vocoders notes that Vocos is extremely fast, but complex-spectrogram prediction without a strong harmonic inductive bias can be less robust to F0 variation. HiFTNet adds a harmonic-plus-noise source/filter in the time-frequency domain and reports ground-truth-level subjective performance on LJSpeech while being much lighter than BigVGAN. Wavehax adds a harmonic prior plus 2D spectral modeling, and reports high-F0 robustness with less than 5% of HiFi-GAN V1’s MACs/params and over 4x faster CPU inference. ([arXiv][3])

**Third:** lightweight bandwidth-extension models that predict only the missing band. AP-BWE uses dual amplitude and phase streams with ConvNeXt backbones and reports SOTA speech BWE quality at both 16 kHz and 48 kHz while running very fast. MS-BWE extends that idea to flexible source/target sample rates with stage-wise blocks and teacher forcing, reporting one-stage 48 kHz generation at about 1271x real-time on a 4090 and about 59.7x real-time on CPU. BAE-Net targets unknown or varying effective bandwidths in streaming settings, and a 2024 DSP-informed paper shows that a very explicit exciter plus linear time-varying filter split can outperform plain black-box generators while using lighter exciters. ([arXiv][4])

### So what should *you* build?

#### 1) Start with a **24 kHz decoder-only retarget**, not 48 kHz

For speech TTS, 24 kHz is the highest-ROI first move. You only need to invent 8-12 kHz, instead of 8-24 kHz, and that is where a lot of the “tunnel” feeling disappears. Also, HiFiTTS-2 explicitly gives you large high-bandwidth training targets for this regime: about 36.7k hours for 22.05 kHz training and 31.7k hours for 44.1 kHz training. ([arXiv][5])

In your specific geometry, keep the **50 Hz token rate** and just retarget the decoder’s waveform geometry:

* 16 kHz: hop 320, win/n_fft 1280
* 24 kHz: hop 480, win/n_fft about 1920
* 32 kHz: hop 640, win/n_fft about 2560
* 48 kHz: hop 960, win/n_fft about 3840

That preserves the same **20 ms frame rate** and **80 ms window duration** in real time, which is exactly what you want. The expensive part of your decoder is the 50 Hz transformer trunk, so if you keep the token rate fixed, most of the extra cost lands in the output head and iSTFT, not in sequence modeling. That is a very favorable trade for realtime deployment.

For the first retrain, I would **not** keep the current Vocos-style single linear mag/phase head unchanged and just hope vibes fix it. Recent work says the fast frame-level spectral approach is good, but speech quality improves when you add more phase structure or harmonic prior. So I would try one of these two replacements first:

* **APNet2 / APCodec-style head**: separate amplitude and phase streams after a shared trunk.
* **Wavehax-lite / HiFTNet-lite head**: harmonic-prior branch plus spectral decoder.

That is where the current research points, and it fits your architecture well. ([arXiv][2])

#### 2) Your easiest production-friendly win is a **residual high-band decoder branch**

This is probably the best balance of quality, latency, and engineering sanity.

Pipeline:

1. Keep your current 16 kHz decoder exactly as the **base low-band path**.
2. Deterministically upsample it to 24/32/48 kHz with a good sinc/polyphase resampler.
3. Add a tiny **high-band residual model** that predicts only the missing band in STFT space.
4. Condition that branch on:

   * `vq_post_emb` or decoder hidden states
   * the upsampled base waveform or its STFT
   * optional F0 / voiced-unvoiced / energy features

This is basically the “stop trying to rewrite what already works” strategy. And it is backed by multiple lines of work: SoundStream explicitly showed decoder-side enhancement with no extra latency, AP-BWE/MS-BWE predict missing-band spectra directly, and AudioLBM reports that low-frequency replacement improves both low-band fidelity and later high-frequency generation in cascaded SR systems. ([arXiv][6])

Concretely, I would make that HF branch predict **only**:

* residual log-magnitude for bins above 8 kHz
* direct phase for those bins, or complex residual if you want a cleaner implementation

And I would hard-bypass or replace the low band from the base decoder. That one design choice alone usually saves a lot of misery. Humans love training a model to “enhance” audio, then act shocked when it mangles the perfectly fine part too.

#### 3) Add **harmonic / voicing structure** explicitly

Speech high-band is not just “more treble.” It is a mix of harmonics, breathiness, and noisy unvoiced consonants. The literature keeps converging on this point from different directions:

* HiFTNet: harmonic-plus-noise source/filter helps.
* Wavehax: harmonic prior improves robustness, especially high F0.
* CodecFlow: explicit voicing-aware conditioning helps recover high-frequency unvoiced detail in codec-latent BWE. ([arXiv][7])

So I would add a tiny auxiliary head that predicts **F0 + voiced/unvoiced + maybe band energy** from the frozen post-VQ embeddings, using teacher labels from a pitch extractor during training. Then use those features to condition the HF branch.

That gives you a very cheap but very meaningful inductive bias. It is also much cleaner than hoping the decoder silently learns that /s/, /sh/, breath, and female vowels need different high-band treatment.

### If you want a stronger decoder replacement, these are the best bets

**Best non-diffusion ceiling:** replace the current Vocos decoder with one of these ideas:

* **APCodec-style decoder**
  Modified ConvNeXt-v2 backbone, amplitude/phase representation, streamable design, 6.67 ms fixed latency at 48 kHz. That is probably the closest precedent to “high-rate codec decoder that still cares about realtime.” ([arXiv][8])

* **Wavehax-lite**
  If your issue is crispness and harmonic realism, this is very attractive. It is spectral-domain, aliasing-aware, harmonic-prior-driven, and fast. Recent streaming analysis also found that multi-stream Wavehax delivered top throughput under low-latency conditions under 80 ms, and got nearly non-causal quality with only one-frame lookahead. ([arXiv][9])

* **HiFTNet-lite**
  If you want something closer to iSTFTNet but more speech-specialized, HiFTNet is a really good signal that source/filter inductive bias pays off. ([arXiv][7])

I would *not* bet my first serious experiment on plain Vocos-at-higher-sample-rate. It is fast, yes. But the comparative papers are basically telling you the thing you’re already hearing: speed alone is not enough if the model lacks harmonic bias. ([arXiv][3])

### Conditional 16 / 24 / 32 / 48 is absolutely possible

This is very doable, and I would prefer **additive band gating** over one giant “do everything” head.

A clean design is:

* **Base16** branch: reconstructs the existing 0-8 kHz band
* **+HF24** branch: adds 8-12 kHz
* **+HF32** branch: adds 12-16 kHz
* **+HF48** branch: adds 16-24 kHz

Each branch is shallow, residual, and conditioned on a **target-rate embedding**. For 16 kHz mode, none of the HF branches run. For 24 kHz, only the first one runs. For 48 kHz, all of them run. That gives you controllability *and* compute that scales with target quality, which is exactly what you want in a production stack.

There is good precedent for both parts of this idea: MS-BWE does stage-wise extension across multiple source/target sample-rate pairs with teacher forcing, and AudioLBM explicitly conditions on source and target frequency information to learn any-to-any upsampling. ([arXiv][10])

If you want the simplest version, do a **shared trunk + rate-specific shallow heads**. If you want the most controllable version, do **cascaded band branches**. I’d pick the second one.

### Losses and training tricks that matter here

This part is boring, which means it matters.

I would not rely on plain full-band mel loss plus generic GAN losses and hope for sparkle. For missing-band reconstruction, that objective is too blunt. The models that work well here tend to use **band-aware spectral objectives** and often explicit **phase losses** or amplitude/phase discriminators. AP-BWE and MS-BWE both predict residual log-amplitude and direct phase, and use waveform, amplitude, and phase discriminators instead of only one undifferentiated adversarial bucket. ([arXiv][4])

My recommended training recipe:

* **Low-band identity loss or hard low-band replacement**
* **High-band-weighted complex STFT loss**
* **Amplitude loss + phase loss** if using dual-stream output
* **A small waveform adversarial loss**
* **One spectral discriminator focused on the new band**
* **Optional F0 / UV auxiliary loss** if you add voicing conditioning

Also, if you do a cascaded setup, copy the **teacher forcing / scheduled sampling** idea from MS-BWE so stage 2 doesn’t collapse when stage 1 makes slightly ugly mistakes. That paper did the annoying but necessary work of showing why this helps. ([arXiv][10])

### How I would structure the experiments

This is the order I’d run, because research is just organized disappointment unless you control the ablations.

**E0.** Baseline: current 16 kHz decoder + best deterministic resampler to 24/48 kHz.
This tells you how much improvement is real and how much is placebo from reconstruction filtering.

**E1.** 24 kHz decoder-only retarget.
Same frozen encoder/quantizer, same cached codes, new 24 kHz head. Train only head + maybe last decoder blocks.

**E2.** 24 kHz dual-stream spectral head.
Same as E1, but amplitude/phase split like APNet2/AP-BWE.

**E3.** 24 kHz residual HF branch.
Keep current decoder as low-band base, predict only >8 kHz residual.

**E4.** 24 -> 48 cascaded stage.
Second stage only predicts 12-24 kHz from tokens + 24 kHz output, with teacher forcing.

**E5.** Optional HQ tier.
Single-step flow or distilled diffusion on latent/STFT residual only, not the whole waveform path.

That ladder gives you an honest answer about whether the real gain is coming from direct high-rate decoding or from a cheap HF restorer.

### Where diffusion fits, if you really want it

If you allow *some* generative overhead, don’t use classic multi-step waveform diffusion in your default path. That is the scenic route to latency regret.

The smarter versions right now are:

* **single-step distilled diffusion**, like FlashSR, which reports competitive 48 kHz SR while being about 22x faster than the previous SOTA it compares against;
* **single-step flow matching**, like FLowHigh, which reports strong VCTK SR results with one-step sampling;
* **latent-space probabilistic models**, which are much more aligned with your cached-token setup than waveform diffusion. Latent-domain upsampling work reports up to 100x efficiency gains over raw-audio post-processing, and the very fresh March 2026 CodecFlow paper specifically does voicing-aware bandwidth extension in codec latent space. ([arXiv][11])

There is also a very relevant codec-side precedent: ScoreDec adds a diffusion post-filter to AudioDec and reports human-level naturalness for full-band 48 kHz speech at 24 kbps. That tells you diffusion *can* close the gap, but in a voice-agent product I would keep it as a premium or offline quality mode, not the default call path. ([arXiv][12])

### My final recommendation

If this were my stack, I would do this, in exactly this order:

1. **Train a 24 kHz decoder-only variant first** using your cached codes and original high-rate waveforms.
2. **Replace the current head with a dual amplitude/phase or harmonic-prior decoder**, not just a naïve bigger ISTFT head.
3. **Add a residual high-band branch with low-band bypass**, conditioned on `vq_post_emb` and F0/UV.
4. **Only then** build a cascaded 24 -> 48 stage if 24 kHz still isn’t enough.
5. Keep **single-step latent flow/diffusion** as an optional HQ path, not the default.

That path has the best odds of giving you a real audible win without detonating latency. And it lines up with what has actually worked in fast speech vocoders, BWE systems, and recent codec research, instead of sending you back into another week of ritual suffering with blind upsamplers.

[1]: https://arxiv.org/pdf/2601.20185 "https://arxiv.org/pdf/2601.20185"
[2]: https://arxiv.org/abs/2311.11545 "https://arxiv.org/abs/2311.11545"
[3]: https://arxiv.org/pdf/2506.03554 "https://arxiv.org/pdf/2506.03554"
[4]: https://arxiv.org/html/2401.06387v1 "https://arxiv.org/html/2401.06387v1"
[5]: https://arxiv.org/abs/2506.04152 "https://arxiv.org/abs/2506.04152"
[6]: https://arxiv.org/abs/2107.03312 "https://arxiv.org/abs/2107.03312"
[7]: https://arxiv.org/abs/2309.09493 "https://arxiv.org/abs/2309.09493"
[8]: https://arxiv.org/pdf/2402.10533 "https://arxiv.org/pdf/2402.10533"
[9]: https://arxiv.org/abs/2411.06807 "https://arxiv.org/abs/2411.06807"
[10]: https://arxiv.org/pdf/2406.02250 "https://arxiv.org/pdf/2406.02250"
[11]: https://arxiv.org/abs/2501.10807 "https://arxiv.org/abs/2501.10807"
[12]: https://arxiv.org/abs/2401.12160 "[2401.12160] ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter"
