| description | Generate per-slide narration audio with AI-recommended voice selection, then optionally re-export PPTX with embedded audio |
|---|
Standalone post-export step. Run when the user asks for "生成音频" / "录制旁白" / "narrated PPT" / "video export with voice", or proactively offer it after a deck is exported. Produces one audio file per slide via
edge-ttsby default, or a cloud TTS provider (elevenlabs/minimax/qwen/cosyvoice) when the user chooses high-quality narration or a cloned voice, then optionally re-exports a video-ready PPTX with audio embedded and per-slide auto-advance timings.
This workflow is independent: it reads notes/*.md and queries the selected TTS voice catalog — no upstream conversation context required. Safe to invoke in a fresh session.
notes/total.mdexists and has been split into per-page files atnotes/*.md(post-processing Step 7.1 done).- Default mode:
edge-ttsis installed (python3 -m pip install edge-tts). - The workflow is page-level only: one notes file becomes one audio file. Do not use a single long audio track or attempt automatic long-audio splitting.
- PPT narration assets must be PowerPoint-reliable audio:
m4a(AAC),mp3, orwav. The built-in TTS path defaults tomp3; provider formats such aspcm,opus, orflacmust be transcoded before embedding. - PowerPoint recorded narration export requires
ffprobeso slide timings can be written from actual audio duration. - High-quality cloud mode: provider API key is set before use:
- ElevenLabs:
ELEVENLABS_API_KEY - MiniMax:
MINIMAX_API_KEY - Qwen:
QWEN_API_KEYorDASHSCOPE_API_KEY - CosyVoice:
COSYVOICE_API_KEYorDASHSCOPE_API_KEY - Keys may live in the current process environment or the first
.envfound in this order: current working directory, skill directory (e.g.~/.agents/skills/ppt-master/.env), clone repo root,~/.ppt-master/.env
- ElevenLabs:
- The deck is in a single dominant language (mixed-language decks: pick the dominant one — the AI uses judgment, not a heuristic).
If notes/*.md are missing, run total_md_split.py <project_path> first.
The AI already knows the deck's language from writing the notes. No detection script needed.
- Identify the primary language from the notes content:
zh/en/ja/ko/ etc. - For mixed-language decks (e.g. Chinese with English technical terms), pick the language the audience will hear most of.
- For Chinese specifically: pick the locale based on context —
zh-CN(mainland mandarin, default),zh-TW(Taiwanese mandarin), orzh-HK(Cantonese). Ask the user only if the project context doesn't make it clear.
Default to edge unless the user explicitly asks for a cloud provider / higher-quality cloud narration / a cloned voice.
edge backend:
python3 skills/ppt-master/scripts/notes_to_audio.py --list-voices --locale <locale>ElevenLabs backend:
python3 skills/ppt-master/scripts/notes_to_audio.py --provider elevenlabs --list-voicesCloud providers using explicit voice IDs/names:
python3 skills/ppt-master/scripts/notes_to_audio.py --provider minimax --list-voices
python3 skills/ppt-master/scripts/notes_to_audio.py --provider qwen --list-voices
python3 skills/ppt-master/scripts/notes_to_audio.py --provider cosyvoice --list-voicesThe output is a flat list of all available voices for the selected provider. From this list, the AI picks 3–6 candidates to recommend, applying these rules:
- Cover both genders when both exist for the locale.
- For edge: prefer
COMMON_VOICES-listed voices (curated set insidenotes_to_audio.py) when the locale has them — they are battle-tested. - For ElevenLabs: prefer voices already present in the user's account; if the user provides a specific
voice_id, do not override it. - For MiniMax / Qwen / CosyVoice: if the user provides a cloned
voice_id, use it directly. Do not attempt voice cloning inside the narration workflow. - Match the deck's tone — pick the strongest recommendation based on style:
- Consultant / data-driven / 财报 → 稳重男声(如
zh-CN-YunjianNeural)or 清晰女声(如zh-CN-XiaoxiaoNeural) - General / 教学 / 产品介绍 → 明亮女声 / 年轻男声(如
zh-CN-XiaoyiNeural/zh-CN-YunxiNeural) - 发布会 / 播报 → 播报感男声(如
zh-CN-YunyangNeural) - English consultant deck →
en-US-GuyNeural(steady) oren-US-JennyNeural(clear) - Japanese / Korean → pick from
ja-JP-*/ko-KR-*neural voices, mark gender + tone
- Consultant / data-driven / 财报 → 稳重男声(如
For each candidate, write a one-line Chinese description covering: 性别 · 调性 · 适用场景。For cloud providers, include the voice name/ID exactly as it must be passed to --voice-id.
Send a single message to the user that asks all three questions at once and provides a recommended value for each. Do NOT split into multiple rounds.
Cloned-voice fast path: if the user mentioned a cloned voice / 克隆音色 / 复刻音色 / "my own voice" along with a voice_id, skip the voice-recommendation list — set the provider to whichever the user named (elevenlabs / minimax / qwen / cosyvoice), pin the voice_id they gave you, and only confirm rate + embed-or-not.
Message template (Chinese; translate to user's chat language if different):
检测到 notes 主语言为 <语言>(locale:
<locale>)。基于 deck 调性(<风格>),我推荐以下配置:生成模式:⭐ 推荐
<edge|elevenlabs|minimax|qwen|cosyvoice>(理由:<一句话,如"无需配置,稳定生成"或"用户要求高质量云端音色">)。音色:
- [1] — <性别·调性·适用场景> ⭐ 推荐
- [2] — <性别·调性·适用场景>
- [3] — <性别·调性·适用场景>
- [4] — <性别·调性·适用场景>
- [5] — <性别·调性·适用场景>
- 也可直接输入清单中的其他 ShortName。
语速/风格参数:⭐ 推荐
<rate or provider defaults>(理由:<一句话,如"页均 2–3 句,正常语速听感最稳"或"ElevenLabs 默认 voice settings 保留音色原始表现最稳">)。生成完是否重新导出嵌入音频的 PPTX:⭐ 推荐 是(一次到位,自动按音频时长设页面停留)。
直接回"好"用全部推荐值,或告诉我想改的部分(如"音色 2,语速 -5%"或"用 MiniMax 的 voice_id xxx")。
Recommended-value rules:
- 生成模式:默认
edge;当用户明确追求高质量云端音色或提供 cloud voice ID 时,按用户指定选elevenlabs/minimax/qwen/cosyvoice。 - 音色:从 Step 2 候选里挑最贴合 deck 调性的那一个。
- 语速:edge 默认
+0%;notes 字数密集(页均 >4 句长句)建议-5%;notes 简短紧凑建议+5%;超出此范围需说明理由。Cloud providers 默认用 provider defaults,除非用户明确要调速或改风格。 - 嵌入:默认推荐"是";除非用户已有定制 PPTX 不希望覆盖。
Run sequentially — do NOT bundle:
# 1A. Generate audio with edge (default)
python3 skills/ppt-master/scripts/notes_to_audio.py <project_path> \
--voice <chosen-ShortName> --rate <chosen-rate>
# 1B. Or generate audio with ElevenLabs
python3 skills/ppt-master/scripts/notes_to_audio.py <project_path> \
--provider elevenlabs --voice-id <chosen-voice-id> \
--elevenlabs-model eleven_multilingual_v2
# 1C. Or generate audio with MiniMax
# Defaults to the China endpoint; set MINIMAX_TTS_BASE_URL=https://api.minimax.io/v1/t2a_v2 for overseas access.
python3 skills/ppt-master/scripts/notes_to_audio.py <project_path> \
--provider minimax --voice-id <chosen-voice-id> \
--minimax-model speech-2.8-hd
# 1D. Or generate audio with Qwen TTS
python3 skills/ppt-master/scripts/notes_to_audio.py <project_path> \
--provider qwen --voice-id <chosen-voice> \
--qwen-model qwen3-tts-flash --qwen-language-type Chinese
# 1E. Or generate audio with CosyVoice
python3 skills/ppt-master/scripts/notes_to_audio.py <project_path> \
--provider cosyvoice --voice-id <chosen-voice> \
--cosyvoice-model cosyvoice-v3-flash
# 2. (If user kept embedding) Re-export PPTX with audio embedded
python3 skills/ppt-master/scripts/svg_to_pptx.py <project_path> \
--recorded-narration audioIf notes_to_audio.py errors with a missing dependency or missing provider API key, fix the prerequisite and re-run — do NOT swallow the error.
--recorded-narration audio prepares PowerPoint's recorded timings and narrations: every slide must have a matching supported audio file, every duration must be readable by ffprobe, and object animations must not use --animation-trigger on-click. Use after-previous or with-previous for narrated/video export.
Output one summary block listing:
- Number of audio files generated and their location (
<project_path>/audio/*). - The provider, voice, and rate/settings actually used.
- (If embedded) the new narrated PPTX path under
<project_path>/exports/. - (If skipped embedding) one-line hint on how to embed later:
python3 skills/ppt-master/scripts/svg_to_pptx.py <project_path> --recorded-narration audio.