● Here are the answers for your inference agent:

  1. Codec weights location:
  On this machine at /root/maya-tts/checkpoints/xcodec2/ckpt/epoch=4-step=1400000.ckpt — this is a single
  checkpoint file that contains both encoder and decoder. The decoding.create() and encoding.create()
  functions both take this same path. There's also a model_config.json in the same directory
  (/root/maya-tts/checkpoints/xcodec2/ckpt/model_config.json). If your inference agent needs these, they'd
  need to be copied or downloaded separately from the LLM checkpoints — they're not in R2's ptcheckpoints
  bucket.

  2. Checkpoint format at ckpt-140000:
  It's the raw DeepSpeed ZeRO-3 training checkpoint — NOT a converted HF checkpoint. It contains sharded
  optimizer states (bf16_zero_pp_rank_*_optim_states.pt), sharded model states
  (zero_pp_rank_*_model_states.pt), plus a consolidated model.safetensors. The model.safetensors + config.json
   + tokenizer.json should be directly loadable by AutoModelForCausalLM.from_pretrained() without running
  convert_checkpoint.py. The DeepSpeed shard files are only needed for resuming training.

  3. Extended tokenizer:
  Yes, confirmed. The tokenizer is saved with every checkpoint — tokenizer.json (~28MB) contains the full
  193,856 token vocabulary including all <|s_0|> through <|s_65535|> speech tokens and the 8 control tokens.

  4. Codec variant:
  Stock XCodec2 checkpoint (epoch=4-step=1400000.ckpt), not custom-trained. The codec produces 65,536 discrete
   codes at 50 tokens/sec from 16kHz audio.

  5. Quality at 140K steps:
  This is an early checkpoint — the run is currently at step ~155K out of 550K (28%). Per-source eval at step
  140K showed audio loss ~7.1, paired loss ~7.0. Text is strong (2.3) since the base Llama already knew text.
  Speech output will likely be intelligible but noisy/repetitive — the model is still learning audio codebook
  patterns. The pre-SFT gate on a 20-step smoke checkpoint scored nondegenerate_fraction=0.71 (below the 0.9
  production threshold). At 140K steps it should be better but don't expect production quality. This is a
  valid sanity-check checkpoint.

  One correction to relay: Your inference agent noted 1,271,565 max steps — that's from an older config. The
  current production run is 550,000 max steps with BS=12 per GPU. Also, the base model is
  meta-llama/Llama-3.2-1B (base), NOT Llama-3.2-1B-Instruct.