GitHub - kehanlu/DeSTA2.5-Audio: Code for DeSTA2.5-Audio

📑 Paper | 👩‍💻 Github | 🤗 Model | 🤗 Dataset

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Self-generated data is what you need for developing general-purpose LALMs!

🧪 A new training framework (read the paper)
- Highly scalable and efficient without task-specific instruction-tuning data
- Preserves language ability and avoids catastrophic forgetting
- Comprehensive studies on data quality in LALM development
📦 Open resources for the community
- Model checkpoints and Training scripts
- DeSTA-AQA5M dataset (5M audio-text pairs from 7,000 hours of audio)

✨ News / Change logs

🚧 Coming soon: vLLM-based data construction script, detailed finetuning tutorials
2025/07/23: Released training scripts. Now you can train your own DeSTA-Audio. 📘 Training README
2025/07/21: Released DeSTA-AQA-5M dataset! 📘 Dataset README 🤗 DeSTA-AQA5M
2025/07/10: Inference code and model checkpoints are live! 🤗 DeSTA2.5-Audio
2025/07/03: DeSTA2.5-Audio paper is on arXiv! 📄 arXiv:2507.02768

📄 Documents

Document	Description
Quickstart	Quickly set up and run the DeSTA-Audio model.
docs/train.md	Instructions and scripts for training the DeSTA-Audio model.
docs/dataset.md	Information about DeSTA-AQA5M
docs/evaluation_tips.md	Tips for evaluating DeSTA2.5-Audio

🧐 Architecture

🚀Quickstart

Installation

git clone https://github.com/kehanlu/DeSTA2.5-Audio.git
cd DeSTA2.5-Audio
pip install -e .

Basic Usage

from desta import DeSTA25AudioModel

# Load the model from Hugging Face
model = DeSTA25AudioModel.from_pretrained("DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B")
model.to("cuda")

# Run inference with audio input
messages = [
    {
        "role": "system",
        "content": "Focus on the audio clips and instructions."
    },
    {
        "role": "user",
        "content": "<|AUDIO|>\nDescribe this audio.",
        "audios": [{
            "audio": "/path/to/audio.wav",  # Path to your audio file
            "text": None
        }]
    }
]

outputs = model.generate(
    messages=messages,
    do_sample=False,
    top_p=1.0,
    temperature=1.0,
    max_new_tokens=512
)

print(outputs.text)

📂 Dataset

See docs/dataset.md for more details.

	Response Generated From	HuggingFace ID	Preview
DeSTA-AQA5M	Llama3.1-8B-Instruct	DeSTA-ntu/DeSTA-AQA5M-FROM-Llama3.1-8B-Instruct	🔍

🚆 Training & Finetuning

See docs/train.md for more details.

bash example/train/train_example.sh

📚 Citation

@article{lu2025desta25Audio,
  title={DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment},
  author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Huang, Sung-Feng and Yang, Chih-Kai and Yu, Chee-En and Chen, Chun-Wei and Chen, Wei-Chih and Huang, Chien-yu and others},
  journal={arXiv preprint arXiv:2507.02768},
  year={2025}
}

@inproceedings{lu2025developing,
  title={Developing instruction-following speech language model without speech instruction-tuning data},
  author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}

@inproceedings{lu24c_interspeech,
  title     = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment},
  author    = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {4159--4163},
  doi       = {10.21437/Interspeech.2024-457},
  issn      = {2958-1796},
}

👥 Contributors

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
desta		desta
docs		docs
examples		examples
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✨ News / Change logs

📄 Documents

🧐 Architecture

🚀Quickstart

Installation

Basic Usage

📂 Dataset

🚆 Training & Finetuning

📚 Citation

👥 Contributors

About

Uh oh!

Releases

Languages

kehanlu/DeSTA2.5-Audio

Folders and files

Latest commit

History

Repository files navigation

✨ News / Change logs

📄 Documents

🧐 Architecture

🚀Quickstart

Installation

Basic Usage

📂 Dataset

🚆 Training & Finetuning

📚 Citation

👥 Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages