Platform | Link |
---|---|
YouTube | |
哔哩哔哩 |
📖 Screen recording videos showing how to use the training and generation scripts for Higgs Audio v2.
数据处理与训练指南
- New language training
- Experimental feature: DDP support. For details, please refer to:
DDP_training.sh
- Optimized input parameters, removed unnecessary misleading parameters
- Adopted official data classes
- Supports LoRA training, 16G memory is sufficient for training
- Provides a mini training set, welcome to use
- Multi-speaker training
git clone https://github.com/JimmyMa99/train-higgs-audio.git
cd train-higgs-audio
conda create -n higgs_audio_env python=3.10
conda activate higgs_audio_env
pip install -r requirements_train.txt
pip install -e .
First, prepare your audio and text data in the required format.
首先,请按照要求准备好音频和文本数据。
ms-swift data format:
ms-swift 数据格式:
{"messages": [{"role": "assistant", "content": "<think>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]}
Run the script
运行脚本
python tools/convert_jsonl_to_higgs.py \
--jsonl_files /path/to/audio.jsonl \
--output_dir ./higgs_training_data \
--copy_audio True
Obtain data in the following format
得到以下格式的数据
higgs_training_data/
├── metadata.json # Overall metadata file of the dataset
├── huo_speaker_000001.wav # Audio file 1 of speaker "huo"
├── huo_speaker_000001.txt # Text transcription corresponding to the audio
├── huo_speaker_000002.wav # Audio file 2 of speaker "huo"
├── huo_speaker_000002.txt # Text transcription corresponding to the audio
├── ... # More audio/text files of "huo_speaker"
├── huo_speaker_000051.wav # Audio file 1 of speaker "huo"
├── huo_speaker_000051.txt # Text transcription corresponding to the audio
├── huo_speaker_000052.wav # Audio file 2 of speaker "huo"
├── huo_speaker_000052.txt # Text transcription corresponding to the audio
└── ... # More audio/text files of "huo_speaker"
metadata.json 格式
{
"dataset_info": {
"total_samples": 2797,
"speakers": [
"huo_speaker"
],
"languages": [
"zh"
],
"total_duration": 12173.9,
"avg_duration": 4.35,
"created_from": [
"/root/code/new_work_code/HI-TransPA/swfit_workdir/fresh-little-lemon-workspace/data/swift_format/huo_audio.jsonl"
]
},
"samples": [
{
"id": "huo_speaker_000000",
"audio_file": "huo_speaker_000000.wav",
"transcript_file": "huo_speaker_000000.txt",
"duration": 3.86,
"speaker_id": "huo_speaker",
"speaker_name": "Huo",
"scene": "recording_system",
"emotion": "alerting",
"ref_audio_file": If you need a reference tone color, please add this field, which will take effect under the "zero_shot_voice_cloning" model. 如果你是需要有参考音色,请加入此字段,这会在"zero_shot_voice_cloning"模型下生效
"language": "zh",
"gender": "unknown",
"quality_score": 1.0,
"original_audio_path": "audio_splits_huo/14_cropped_with_audio_line000001_vid00_f7b81293.wav",
"user_instruction": "<audio> /translate",
"task_type": "audio_generation"
},
{
"id": "huo_speaker_000001",
"audio_file": "huo_speaker_000001.wav",
"transcript_file": "huo_speaker_000001.txt",
"duration": 3.2,
"speaker_id": "huo_speaker",
"speaker_name": "Huo",
"scene": "quiet_room",
"emotion": "questioning",
"ref_audio_file": If you need a reference tone color, please add this field, which will take effect under the "zero_shot_voice_cloning" model. 如果你是需要有参考音色,请加入此字段,这会在"zero_shot_voice_cloning"模型下生效
"language": "zh",
"gender": "unknown",
"quality_score": 1.0,
"original_audio_path": "audio_splits_huo/126_cropped_with_audio_line000002_vid00_66220ae5.wav",
"user_instruction": "<audio> /translate",
"task_type": "audio_generation"
}
]
}
Please make sure to modify all parameters before training, including data path, model path, number of training epochs, etc.
请务必在训练前修改各个参数,包括数据路径、模型路径、训练轮数等。
python trainer/trainer.py
Fine-tuning with LoRA requires the use of --use_lora
, like:
python trainer/trainer.py --use_lora
It should be noted that when using LoRA to fine-tune new voices, there may be cases where normal output cannot be achieved. This issue has currently been found in the migration fine-tuning of Vietnamese, and it is not yet clear whether it is a training problem or other circumstances. Based on past experience, when training a model to learn knowledge it has never been exposed to, it is better to use full fine-tuning with the parameter --use_lora False
.
bash merge_model.sh \
--base_model_path xxx \
--lora_adapter_path xxx \
--output_path xxx \
--compare_models \
--test_input "A custom sentence for testing."
bash generate.sh
To intuitively show the difference between generated sounds and real sounds, the following table contains directly playable audio files:
为直观展示生成声音与真实声音的差异,以下表格包含可直接播放的音频文件:
Since the data I have is the speech of hearing-impaired individuals, for the purpose of comparison, I selected a speech sample from a hearing-impaired person as the real voice, and a generated version of the same speech as the generated voice. 因为我手上的数据是听障人士的语音,因此在对比时,我选择了一个听障人士的语音作为真实声音,另一个相同语音的生成版本作为生成声音。
text 文本内容 | real record 真实声音(用户后录) | generate record生成声音(脚本输出) |
---|---|---|
大家好,我是火君,我居住在上海 | 点击播放/下载 (huojun.MP3) | 点击播放/下载 (huojun_gen.wav) |
我爱机智流,机智流是最好的开源社区 | 点击播放/下载 (smartflowai.MP3) | 点击播放/下载 (smartflowai_gen.wav) |
tôi cũng như là những người lính như | 点击播放/下载 (vn_demo.MP3) | 点击播放/下载 (vn_gen.wav) |
训练前后对比(没有使用参考音频)
text 文本内容 | before training 训练前 | after training 训练后 |
---|---|---|
你好,我是火君 | 点击播放/下载 (huojun.MP3) | 点击播放/下载 (huojun_gen.wav) |
We are open-sourcing Higgs Audio v2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data. Despite having no post-training or fine-huoing, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.
On EmergentTTS-Eval, it achieves win rates of 75.7% and 55.7% over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including generating natural multi-speaker dialogues in multiple languages, automatic prosody adaptation during narration, melodic humming with the cloned voice, and simultaneous generation of speech and background music.
Here's the demo video that shows some of its emergent capabilities (remember to unmute):
open_source_repo_demo.mp4
Here's another demo video that show-cases the model's multilingual capability and how it enabled live translation (remember to unmute):
live_translation_v4_compressed.mp4
We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment. Following are two docker images that we have verified:
- nvcr.io/nvidia/pytorch:25.02-py3
- nvcr.io/nvidia/pytorch:25.01-py3
Here's an example command for launching a docker container environment. Please also check the official NVIDIA documentations.
docker run --gpus all --ipc=host --net=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/pytorch:25.02-py3 bash
git clone https://github.com/boson-ai/higgs-audio.git
cd higgs-audio
pip install -r requirements.txt
pip install -e .
git clone https://github.com/boson-ai/higgs-audio.git
cd higgs-audio
python3 -m venv higgs_audio_env
source higgs_audio_env/bin/activate
pip install -r requirements.txt
pip install -e .
git clone https://github.com/boson-ai/higgs-audio.git
cd higgs-audio
conda create -n higgs_audio_env python=3.10
conda activate higgs_audio_env
pip install -r requirements.txt
pip install -e .
git clone https://github.com/boson-ai/higgs-audio.git
cd higgs-audio
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .
For advanced usage with higher throughput, we also built OpenAI compatible API server backed by vLLM engine for you to use. Please refer to examples/vllm for more details.
Tip
For optimal performance, run the generation examples on a machine equipped with GPU with at least 24GB memory!
Here's a basic python snippet to help you get started.
from boson_multimodal.serve.serve_engine import HiggsAudioServeEngine, HiggsAudioResponse
from boson_multimodal.data_types import ChatMLSample, Message, AudioContent
import torch
import torchaudio
import time
import click
MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"
AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"
system_prompt = (
"Generate audio following instruction.\n\n<|scene_desc_start|>\nAudio is recorded from a quiet room.\n<|scene_desc_end|>"
)
messages = [
Message(
role="system",
content=system_prompt,
),
Message(
role="user",
content="The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
),
]
device = "cuda" if torch.cuda.is_available() else "cpu"
serve_engine = HiggsAudioServeEngine(MODEL_PATH, AUDIO_TOKENIZER_PATH, device=device)
output: HiggsAudioResponse = serve_engine.generate(
chat_ml_sample=ChatMLSample(messages=messages),
max_new_tokens=1024,
temperature=0.3,
top_p=0.95,
top_k=50,
stop_strings=["<|end_of_text|>", "<|eot_id|>"],
)
torchaudio.save(f"output.wav", torch.from_numpy(output.audio)[None, :], output.sampling_rate)
We also provide a list of examples under examples. In the following we highlight a few examples to help you use Higgs Audio v2.
Generate audio that sounds similar as the provided reference audio.
python3 examples/generation.py \
--transcript "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years." \
--ref_audio belinda \
--temperature 0.3 \
--out_path generation.wav
The generation script will automatically use cuda:0
if it founds cuda is available. To change the device id, specify --device_id
:
python3 examples/generation.py \
--transcript "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years." \
--ref_audio belinda \
--temperature 0.3 \
--device_id 0 \
--out_path generation.wav
You can also try other voices. Check more example voices in examples/voice_prompts. You can also add your own voice to the folder.
python3 examples/generation.py \
--transcript "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years." \
--ref_audio broom_salesman \
--temperature 0.3 \
--out_path generation.wav
If you do not specify reference voice, the model will decide the voice based on the transcript it sees.
python3 examples/generation.py \
--transcript "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years." \
--temperature 0.3 \
--out_path generation.wav
Generate multi-speaker dialog. The model will decide the voices based on the transcript it sees.
python3 examples/generation.py \
--transcript examples/transcript/multi_speaker/en_argument.txt \
--seed 12345 \
--out_path generation.wav
Generate multi-speaker dialog with the voices you picked.
python3 examples/generation.py \
--transcript examples/transcript/multi_speaker/en_argument.txt \
--ref_audio belinda,broom_salesman \
--ref_audio_in_system_message \
--chunk_method speaker \
--seed 12345 \
--out_path generation.wav
Higgs Audio v2 adopts the "generation variant" depicted in the architecture figure above. Its strong performance is driven by three key technical innovations:
- We developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as AudioVerse. The in-house understanding model is finehuoed on top of Higgs Audio v1 Understanding, which adopts the "understanding variant" shown in the architecture figure.
- We trained a unified audio tokenizer from scratch that captures both semantic and acoustic features. Learn more in the tokenizer blog.
- We proposed the DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead. See the architecture blog.
Here's the performance of Higgs Audio v2 on four benchmarks, Seed-TTS Eval, Emotional Speech Dataset (ESD), EmergentTTS-Eval, and Multi-speaker Eval:
We prompt Higgs Audio v2 with the reference text, reference audio, and target text for zero-shot TTS. We use the standard evaluation metrics from Seed-TTS Eval and ESD.
SeedTTS-Eval | ESD | |||
---|---|---|---|---|
WER ↓ | SIM ↑ | WER ↓ | SIM (emo2vec) ↑ | |
Cosyvoice2 | 2.28 | 65.49 | 2.71 | 80.48 |
Qwen2.5-omni† | 2.33 | 64.10 | - | - |
ElevenLabs Multilingual V2 | 1.43 | 50.00 | 1.66 | 65.87 |
Higgs Audio v1 | 2.18 | 66.27 | 1.49 | 82.84 |
Higgs Audio v2 (base) | 2.44 | 67.70 | 1.78 | 86.13 |
Following the EmergentTTS-Eval Paper, we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. The judge model is Gemini 2.5 Pro.
Model | Emotions (%) ↑ | Questions (%) ↑ |
---|---|---|
Higgs Audio v2 (base) | 75.71% | 55.71% |
gpt-4o-audio-preview† | 61.64% | 47.85% |
Hume.AI | 61.60% | 43.21% |
BASELINE: gpt-4o-mini-tts | 50.00% | 50.00% |
Qwen 2.5 Omni† | 41.60% | 51.78% |
minimax/speech-02-hd | 40.86% | 47.32% |
ElevenLabs Multilingual v2 | 30.35% | 39.46% |
DeepGram Aura-2 | 29.28% | 48.21% |
Sesame csm-1B | 15.96% | 31.78% |
'†' means using the strong-prompting method described in the paper.
We also designed a multi-speaker evaluation benchmark to evaluate the capability of Higgs Audio v2 for multi-speaker dialog generation. The benchmark contains three subsets
two-speaker-conversation
: 1000 synthetic dialogues involving two speakers. We fix two reference audio clips to evaluate the model's ability in double voice cloning for utterances ranging from 4 to 10 dialogues between two randomly chosen persona.small talk (no ref)
: 250 synthetic dialogues curated in the same way as above, but are characterized by short utterances and a limited number of turns (4–6), we do not fix reference audios in this case and this set is designed to evaluate the model's ability to automatically assign appropriate voices to speakers.small talk (ref)
: 250 synthetic dialogues similar to above, but contains even shorter utterances as this set is meant to include reference clips in it's context, similar totwo-speaker-conversation
.
We report the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than Higgs Audio v2, we also evaluated MoonCast and nari-labs/Dia-1.6B-0626, two of the most popular open-source models capable of multi-speaker dialog generation. Results are summarized in the following table. We are not able to run nari-labs/Dia-1.6B-0626 on our "two-speaker-conversation" subset due to its strict limitation on the length of the utterances and output audio.
two-speaker-conversation | small talk | small talk (no ref) | ||||
---|---|---|---|---|---|---|
WER ↓ | Mean Sim & Dis-sim ↑ | WER ↓ | Mean Sim & Dis-sim ↑ | WER ↓ | Mean Sim & Dis-sim ↑ | |
MoonCast | 38.77 | 46.02 | 8.33 | 63.68 | 24.65 | 53.94 |
nari-labs/Dia-1.6B-0626 | - | - | 17.62 | 63.15 | 19.46 | 61.14 |
Higgs Audio v2 (base) | 18.88 | 51.95 | 11.89 | 67.92 | 14.65 | 55.28 |
The boson_multimodal/audio_processing/
directory contains code derived from third-party repositories, primarily from xcodec. Please see the LICENSE
in that directory for complete attribution and licensing information.