[Draft] Parakeet export#1977
Conversation
- C++ implementation: NemotronModel + NemotronState with 3-session RNNT pipeline (encoder, decoder, joint) and greedy decode - Model type registration: nemotron_asr added to ALM types - Config parsing: encoder/decoder I/O names, audio parameters - Export tooling: ONNX export, tokenizer conversion, graph fusion + INT4 quantization (3.6x encoder size reduction) - E2E tests: dummy audio + real speech validation - Documentation: architecture overview and usage guide
Add RunStreamingEncoder with cache carry-forward (MHA + causal conv). Add GreedyDecodeIncremental for per-chunk RNNT decode. Auto-detect streaming mode via encoder ONNX input probing. Batch mode fallback supports per-chunk re-inference. Export script: --streaming flag wraps encoder with cache I/O. Streaming encoder: 5 inputs (audio + length + 3 caches), 5 outputs. Cache shapes: channel [B,24,70,1024], time [B,24,1024,8], len [B].
forward_for_export() handles [B, n_layers, ...] <-> [n_layers, B, ...] transposition internally. The wrapper was incorrectly adding another transpose, causing RuntimeError in multi_head_attention during export. Fix: Remove transpose calls from StreamingEncoderWrapper.forward(). ONNX I/O consistently uses [B, n_layers, ...] format for caches. Also: clarify cache format comments in nemotron.h/cpp.
- export_nemotron_to_onnx.py: add generate_genai_config() and generate_audio_processor_config() that extract model dimensions, I/O names, and audio params from the loaded NeMo model - optimize_encoder.py: annotate genai_config.json with optimization metadata (fusion type, quantization method) when INT4 is applied
…d quantization - Mel replay buffer: save mel chunks during blank periods, replay after encoder+decoder reset to recover lost audio instead of hallucinating - Encoder cache reset: zero all 3 encoder cache tensors when stuck detector triggers (2+ consecutive blank chunks), not just decoder LSTM state - Hybrid decoder reset: reset decoder state on first blank chunk, encoder caches on second+ blank, then replay buffered mel data - k_quant_mixed quantization: mixed-precision INT4 that preserves FP32 for sensitive layers (attention Q/K/V/Out, first/last encoder, pre_encode) - HQQ quantization option: Half-Quadratic Quantization support - optimize_encoder.py: --quant_method flag (rtn|k_quant_mixed|hqq), sensitive node detection, external data filename preservation - generators.cpp: graceful stop via shouldStop flag, drain-on-stop, CommitAudio + polling for clean shutdown
…ig parser Support streaming cache-aware encoder inputs (cache_last_channel, cache_last_time, cache_last_channel_len) and outputs (*_next variants) in genai_config.json parsing. Also add optimization section sink to silently consume encoder.optimization metadata.
| """ | ||
|
|
||
| import argparse | ||
| import os |
Check notice
Code scanning / CodeQL
Unused import Note
| print(f" ✓ Full sequence ({len(token_list)} tokens): {token_list[:20]}{'...' if len(token_list) > 20 else ''}") | ||
|
|
||
| # Decode tokens to text (skip the first token which is the dummy BOS) | ||
| decoded_ids = np.array(token_list[1:], dtype=np.int32) # Skip dummy BOS |
Check notice
Code scanning / CodeQL
Unused local variable Note test
| resampler = torchaudio.transforms.Resample(sr, 16000) | ||
| waveform_t = resampler(waveform_t) | ||
| waveform_np = waveform_t.squeeze(0).numpy() | ||
| sr = 16000 |
Check notice
Code scanning / CodeQL
Unused local variable Note test
| resampler = torchaudio.transforms.Resample(sr, 16000) | ||
| waveform_t = resampler(waveform_t) | ||
| waveform_np = waveform_t.squeeze(0).numpy() | ||
| sr = 16000 |
Check notice
Code scanning / CodeQL
Unused local variable Note test
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| success = main() |
Check notice
Code scanning / CodeQL
Unused global variable Note test
4fb3784 to
f456a3d
Compare
f456a3d to
79c5025
Compare
79c5025 to
f9160cd
Compare
| import argparse | ||
| import json | ||
| import os | ||
| import shutil |
Check notice
Code scanning / CodeQL
Unused import Note
No description provided.