# Upsampling neural codec output from 16kHz to 48kHz: a complete technical guide

**A frozen 16kHz encoder can absolutely drive a decoder producing 24–48kHz output** — this has been directly demonstrated by a 2025 modification to X-Codec 2.0 that achieved +0.29 MOS improvement over the 16kHz baseline with only decoder-side changes. The most practical path forward combines a modified ISTFT decoder head with GAN-based training, achievable within your existing Vocos architecture. Four concrete strategies are viable, ranked by implementation complexity and expected quality: (1) direct ISTFTHead rescaling for 24kHz, (2) multi-band ISTFT with PQMF synthesis for 48kHz, (3) a modular post-hoc bandwidth extension network like AP-BWE, or (4) single-step flow matching for maximum quality. All four meet the RTF ≤ 0.1 constraint.

---

## The information-theoretic argument for why this works

The central question — whether a decoder can synthesize frequencies above 8kHz from features that only encode up to 8kHz (the Nyquist of 16kHz audio) — has a definitive empirical answer: **yes**. Speech is a quasi-periodic signal where higher harmonics are strongly correlated with fundamentals, formant structure extends predictably across the full spectrum, and fricative/noise patterns correlate with energy envelopes in lower bands. The decoder learns statistical priors about high-frequency content during training on wideband audio.

Three independent systems confirm this. The X-Codec 2.0 modification (arXiv 2601.20185) froze the 16kHz encoder and HuBERT semantic encoder, modified only the decoder's STFT parameters (hop 320→960, output 16→24kHz), interpolated output projection weights, and fine-tuned. LinaCodec takes this further: it encodes at 24kHz (12kHz Nyquist) but decodes to **48kHz** using a dual-path Vocos decoder with Snake-based upsampling from BigVGAN, generating a full 12kHz of "hallucinated" bandwidth. NVSR demonstrated that even simple replication-padding of mel bins lets a vocoder generate meaningful high-frequency energy, because "prior knowledge in the vocoder maps constant energy in padded higher-frequency bands into meaningful energy distribution."

The practical limit appears to be roughly **3× the encoder's Nyquist frequency** — so from your 8kHz Nyquist, generating content up to ~24kHz (48kHz sample rate) is feasible but increasingly relies on learned priors rather than encoded information. Quality degrades gracefully: 24kHz output (extending to 12kHz) is easier and higher-fidelity than 48kHz output (extending to 24kHz).

---

## Strategy 1: Direct ISTFTHead rescaling — the simplest path to 24kHz

This is the most straightforward approach and has direct empirical validation. Your current architecture uses n_fft=1280, hop=320 at 16kHz, producing 641 frequency bins at 50 fps. To maintain the same 50 fps frame rate at higher sample rates while only modifying the decoder head:

| Target SR | Required hop | Suggested n_fft | Frequency bins | Output channels (mag+phase) |
|-----------|-------------|-----------------|----------------|---------------------------|
| 16 kHz | 320 | 1280 | 641 | 1,282 |
| 24 kHz | 480 | 1920 | 961 | 1,922 |
| 32 kHz | 640 | 2560 | 1,281 | 2,562 |
| 48 kHz | 960 | 3840 | 1,921 | 3,842 |

For FFT efficiency, use n_fft=2048 (for 24kHz) or n_fft=4096 (for 48kHz) with slight frequency oversampling. The VocosBackbone remains unchanged — it still processes features at 50 fps. Only the final projection layer grows: from `backbone_dim × 1282` to `backbone_dim × 3842` parameters for 48kHz.

**Weight initialization via linear interpolation** (proven in the X-Codec 2.0 modification): for each new output index `i` in the larger projection, compute `x_i = (L-1)/(L'-1) × i` and interpolate `w'_i = (1 - α_i) × w_{⌊x_i⌋} + α_i × w_{⌈x_i⌉}`. This provides a meaningful starting point that preserves low-frequency reconstruction quality while the model learns high-frequency generation during fine-tuning.

**The key practical concern** is whether the 12-layer transformer backbone has sufficient capacity to predict 3× more spectral coefficients at 48kHz. For 24kHz (1.5× increase), existing backbone capacity is likely sufficient. For 48kHz, consider adding an intermediate projection MLP between backbone and ISTFTHead (e.g., `backbone_dim → 2×backbone_dim → 3842`) rather than expanding the backbone itself. This keeps computational cost nearly identical during the backbone forward pass, adding only a lightweight projection.

**Implementation priority**: Start with 24kHz (lowest risk, directly validated). If quality is acceptable, attempt 48kHz with the expanded projection head. If 48kHz quality suffers, move to Strategy 2.

---

## Strategy 2: Multi-band ISTFT with PQMF synthesis for 48kHz

For 48kHz output, multi-band decomposition offers a more elegant solution than a single massive ISTFTHead. This approach is validated by MB-ISTFT-VITS, MS-Vocos, and DMNet.

**The 4-band configuration** splits the 48kHz target into four 12kHz sub-bands via Pseudo Quadrature Mirror Filter (PQMF) banks. Each sub-band ISTFT uses n_fft=960, hop=240 at 50 fps, requiring only 481 frequency bins per band. Total output channels: `4 × 2 × 481 = 3,848` — similar to the direct approach but with structured frequency decomposition that allows sub-band-specific training objectives.

**An elegant 3-band variant** uses sub-bands at 16kHz each (n_fft=1280, hop=320), meaning the existing ISTFTHead architecture is reused exactly for one band, and two additional heads with identical architecture handle the upper bands. PQMF synthesis combines `3 × 16kHz → 48kHz`. This minimizes architectural changes and lets you initialize one band's weights from your trained 16kHz model.

The multi-band approach has several advantages for high-frequency generation. Different bands can receive specialized discriminator attention — the high-frequency bands benefit from stochastic generation (noise-like fricatives), while low-frequency bands need precise harmonic modeling. Sub-band discriminators can target frequency-specific artifacts. The PQMF filters themselves are fixed (no learned parameters; order 63 is standard), and implementations are available in torchaudio and in the DAC/SNAC codebases.

**MS-Vocos** (from the streaming vocoder literature) validates this specifically for Vocos: it splits the output along the channel axis into 4 sub-spectrograms, applies iSTFT per sub-band with reduced frame parameters, and uses a trainable synthesis filter. Quality matches standard Vocos with improved streaming characteristics.

---

## Strategy 3: Modular post-hoc bandwidth extension

If architectural modifications to the decoder are undesirable, a separate bandwidth extension network after the 16kHz decoder output offers the most modular path. Three options stand out, ranked by quality-speed tradeoff:

**AP-BWE** (Lu et al., IEEE/ACM TASLP 2024) is the strongest candidate for real-time applications. It uses a dual-stream all-CNN architecture that predicts amplitude and phase spectra in parallel with mutual interaction between streams, using ConvNeXt blocks throughout. It achieves **292× real-time on RTX 4090 GPU and 18× real-time on CPU** — comfortably within the RTF ≤ 0.1 requirement even on CPU. It was the first method to directly extend high-frequency phase (not just magnitude), which is critical for avoiding metallic artifacts. Training uses multi-period discriminators at the waveform level plus multi-resolution amplitude and phase discriminators. Open-source at github.com/yxlu-0102/AP-BWE.

**HiFi-GAN+ BWE** (brentspell/hifi-gan-bwe on GitHub, MIT license) provides the simplest deployment path. It uses a ~1M parameter WaveNet-style residual stack with 16 dilated convolution layers. A single pre-trained model handles arbitrary input sample rates (8/16/24kHz → 48kHz). It operates as a waveform-to-waveform transform — the input is first resampled to 48kHz via bandlimited interpolation, then the network fills in high-frequency content. Available via pip install.

**MusicHiFi-BWE** (Adobe Research, 2024) uses a HiFi-GAN generator with skip connection from the upsampled input, achieving **1786× real-time on A100**. Its design specifically targets post-processing of generative model outputs, making it architecturally aligned with the codec use case. The skip connection from the bilinearly upsampled input is a key design choice — it lets the network focus on generating only the high-frequency residual rather than reconstructing the entire signal.

**Common failure modes of GAN-based BWE and how to avoid them**: Over-smoothing in the >12kHz range (mitigated by high-frequency-specific discriminators), phase incoherence causing metallic artifacts (mitigated by AP-BWE's dual-stream phase prediction or complex STFT discriminators), mode collapse to limited high-frequency patterns (mitigated by diverse training data and multi-period discriminators), and mismatch between training conditions and codec artifacts (mitigated by training on actual codec output rather than clean low-pass filtered audio).

---

## Strategy 4: Single-step generation models for maximum quality

For applications where slightly higher latency is acceptable (RTF ~0.07 rather than ~0.003), two 2025 methods push quality beyond what pure GANs achieve.

**FLowHigh** (ICASSP 2025, Resemble AI) is the first flow-matching model for audio super-resolution. It uses a transformer-based vector field estimator (2 layers, 16-head attention, 1024 embedding dims) operating on mel-spectrograms, with specially designed conditional probability paths enabling **single-step Euler ODE sampling**. It achieves state-of-the-art LSD and ViSQOL on VCTK across all input rates (8/12/16/24kHz → 48kHz). The catch: it requires a BigVGAN vocoder for mel→waveform conversion, adding latency. Open-source at github.com/resemble-ai/flowhigh.

**FlashSR** (ICASSP 2025, KAIST) distills AudioSR's 50-step latent diffusion model into a single-step student using three losses (distillation, adversarial, distribution-matching distillation). Only ~45M parameters are updated via LoRA on attention modules of a 258M total model. It introduces an "SR Vocoder" that conditions waveform generation on both the predicted mel-spectrogram and the low-resolution input waveform, eliminating the need for low-frequency replacement post-processing. Achieves **22× speedup over AudioSR** (0.36s for 5.12s audio on A6000) with competitive subjective quality. Handles 4–32kHz input → 48kHz versatilely.

**UniverSR** (2025) eliminates the vocoder bottleneck entirely by directly predicting complex STFT coefficients via flow matching and reconstructing via iSTFT — architecturally aligned with the Vocos philosophy.

---

## How production systems actually handle this problem

An important finding from surveying production TTS: **most systems avoid the BWE problem entirely** by training codecs at the target sample rate from the start. This is the dominant industry pattern, but it requires control over the encoder — which you don't have.

Cartesia Sonic outputs at 24kHz or 44.1kHz (configurable via API) using SSM-based architecture; internal upsampling details are proprietary. Sesame CSM uses Kyutai's Mimi codec at 24kHz natively with no separate BWE stage. F5-TTS generates mel-spectrograms decoded by Vocos at 24kHz. CosyVoice 2 uses a flow-matching model to generate 50Hz mel features decoded by HiFi-GAN at 22.05kHz. Parler-TTS uniquely uses DAC at 44.1kHz — the codec itself operates at the target rate.

**DAC at 44.1kHz** uses encoder strides [2,4,8,8] (total 512×) with a decoder dimension of **1536** (vs 64 for the encoder), demonstrating significant asymmetric capacity allocation. The decoder uses transposed convolutions with Snake activations. EnCodec's 48kHz model uses chunk-based non-causal processing and music-specific training. Both use fundamentally the same conv encoder → RVQ → conv decoder architecture as their lower-rate counterparts — the difference is stride patterns and decoder capacity, not architectural paradigm.

**WavTokenizer** is particularly relevant: it operates at 24kHz with a single quantizer (like your FSQ) and uses a Vocos-inspired iFFT decoder structure with attention networks, achieving 40–75 tokens/second. Its design validates that Vocos-style decoders with attention can produce high-quality output from single-codebook representations.

---

## Training recipe for the frozen-encoder scenario

Based on patterns validated across DisCoder, MelCap, LFSC, HH-Codec, and the X-Codec 2.0 modification, here is a concrete training recipe:

**Data preparation**: Use HiFi-TTS2 (31.7k hours at 44.1kHz+, available on HuggingFace as nvidia/hifitts-2). Filter by the dataset's bandwidth estimation metadata for recordings with genuine high-frequency content (>20kHz estimated bandwidth). Even **100–500 hours of filtered high-bandwidth data is sufficient** — AP-BWE achieved SOTA with just 44 hours of VCTK, and SNAC trained on 2,730 hours. Create training pairs by encoding 48kHz audio through the frozen encoder+FSQ at 16kHz, then training the modified decoder to reconstruct the original 48kHz target.

**Loss function recipe** (ordered by importance):

- Multi-resolution mel-spectrogram L1 loss across 5–7 mel-bin counts (32, 64, 128, 256, 512) and multiple STFT configurations — this is the primary reconstruction signal
- Multi-resolution STFT loss (spectral convergence + log magnitude) at configurations like (2048,512), (1024,256), (512,128), (256,64)
- GAN adversarial loss (least-squares formulation) from all discriminators
- Feature matching loss (L1 on intermediate discriminator features)
- If operating in spectral domain: amplitude spectrum MSE + anti-wrapping phase losses (instantaneous phase error, group delay continuity, instantaneous frequency continuity)

**Discriminator configuration**: Use MPD with periods [2,3,5,7,11] for harmonic structure, complex multi-scale STFT discriminator (from DAC/SNAC) for high-frequency and phase fidelity, and optionally multi-resolution amplitude/phase discriminators from AP-BWE if predicting spectral coefficients. The `auraloss` library provides ready-to-use multi-resolution STFT loss implementations.

**Training schedule** (three phases):

1. **Warmup (0–20k steps)**: Reconstruction losses only (no GAN). Learning rate 2×10⁻⁴ for decoder with AdamW. This stabilizes the modified output dimensions before adversarial training.
2. **GAN activation (20k–100k steps)**: Gradually introduce discriminator losses. Use separate Adam optimizer for discriminators (lr=2×10⁻⁴). EnCodec's gradient balancer mechanism (where each loss weight defines its fraction of the overall gradient) helps with loss balancing.
3. **Full training (100k–400k steps)**: All losses active. Exponential LR decay (~0.999996 per step). SNAC demonstrates stable training without gradient clipping when using depthwise convolutions.

**Fine-tuning vs. training from scratch**: Fine-tune from the existing 16kHz Vocos decoder weights. The lower transformer layers encode semantic and structural information that transfers directly. Use a lower learning rate (1×10⁻⁵) for pre-trained backbone layers and higher (1×10⁻⁴) for the new/modified ISTFTHead projection. DisCoder, MelCap, and the X-Codec 2.0 modification all validate this staged approach of freezing encoder+VQ and fine-tuning the decoder.

---

## Prioritized implementation roadmap

Based on likelihood of success, implementation complexity, and evidence strength, here is the recommended order of attempts:

**Tier 1 — Try first (highest confidence, lowest risk)**

1. **Direct ISTFTHead rescaling to 24kHz** (n_fft=2048, hop=480). Directly validated by X-Codec 2.0 modification paper. Initialize via weight interpolation. Fine-tune 100–200k steps. Expected result: +0.29 MOS or better over 16kHz baseline. RTF ~0.001 on GPU.

2. **AP-BWE as post-hoc module** for 16kHz→48kHz. Pre-existing open-source code and methodology. Train on codec output → 48kHz target pairs. RTF ~0.003 on GPU, ~0.055 on CPU. Completely modular — no changes to existing codec.

**Tier 2 — Try if Tier 1 quality is insufficient**

3. **Multi-band ISTFTHead (4 bands) for 48kHz**. Each sub-band at 12kHz with PQMF synthesis. Reuse existing 16kHz head architecture where possible. Enables sub-band-specific discriminators. More architectural work but better frequency decomposition for 48kHz.

4. **MusicHiFi-BWE-style separate upsampler**. HiFi-GAN generator with 3× upsampling + skip connection from input. ~1786× real-time. Simple to implement and proven for generative model post-processing.

**Tier 3 — For maximum quality when latency budget allows**

5. **FLowHigh single-step flow matching** for mel-spectrogram super-resolution + Vocos/BigVGAN vocoder. SOTA metrics on benchmarks. RTF ~0.07 — tight but within budget.

6. **Consistency model distillation** of a diffusion-based audio SR model. No dedicated implementation exists yet, but CoMoSpeech demonstrates the approach achieves RTF ~0.007 for TTS — the same distillation framework applies to super-resolution.

**Tier 4 — Experimental / longer-term**

7. **Controllable multi-resolution decoder** with sample-rate conditioning embedding (extending Vocos's existing bandwidth_id mechanism). Train a single decoder that outputs 16/24/32/48kHz based on a conditioning signal. Requires more training data diversity but offers maximum deployment flexibility.

8. **VQ-based upsampling** where a small transformer predicts 48kHz codec tokens from 16kHz codec tokens. Interesting but autoregressive prediction creates a latency bottleneck unless non-autoregressive methods (masked prediction) are used.

---

## Conclusion

The frozen-encoder constraint is not the obstacle it might appear. Multiple 2024–2025 systems demonstrate that decoder-only modifications can bridge 16kHz encoding to 24–48kHz output with high fidelity. **The most important architectural insight is that Vocos's ISTFT-based design makes sample-rate changes nearly trivial** — the backbone operates at a fixed frame rate regardless of target sample rate, and only the output projection and STFT parameters change. This is fundamentally easier than modifying transposed-convolution-based decoders (like EnCodec's SEANet) where stride patterns are deeply embedded in the architecture.

For immediate deployment, the combination of **direct ISTFTHead rescaling to 24kHz** (for the best quality-latency tradeoff) plus **AP-BWE as a modular 24→48kHz stage** (if 48kHz is required) offers the lowest implementation risk with proven results. For a unified 48kHz solution, the multi-band ISTFT approach with PQMF synthesis provides the cleanest architecture. The field is converging on the view that bandwidth extension is better handled inside the decoder via increased capacity than as a separate post-processing stage — but the modular approach remains a valid engineering choice when decoder modifications are costly or risky.