The official PyTorch implementation of the paper "Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation".
If you find this project or the paper useful in your research, please cite us:
@inproceedings{light-t2m,
title={Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation},
author={Zeng, Ling-An and Huang, Guohong and Wu, Gaojie and Zheng, Wei-Shi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2025}
}
We tested our code using Python 3.10.14, PyTorch 2.2.2, CUDA 12.1, and NVIDIA RTX 3090 GPUs.
conda create -n light-t2m python==3.10.14
conda activate light-t2m
# install pytorch
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
# install requirements
pip install -r requirements.txt
# install mamba
cd mamba && pip install -e .
We conduct experiments on the HumanML3D and KIT-ML datasets. For both datasets, you can download them by following the instructions in HumanML3D.
Then, copy both datasets to our repository. For example, the file directory for HumanML3D should look like this:
./data/HumanML3D/
├── new_joint_vecs/
├── texts/
├── Mean.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
├── Std.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt
To speed up data loading during training, we convert the datasets into .npy files using the following commands:
python src/tools/data_preprocess.py --dataset hml3d
python src/tools/data_preprocess.py --dataset kit
We train our Light-T2M model on two RTX 3090 GPUs.
- HumanML3D
python src/train.py trainer.devices=\"0,1\" logger=wandb data=hml3d_light_final \
data.batch_size=128 data.repeat_dataset=5 trainer.max_epochs=600 \
callbacks/model_checkpoint=t2m +model/lr_scheduler=cosine model.guidance_scale=4\
model.noise_scheduler.prediction_type=sample trainer.precision=bf16-mixed
- KIT-ML
python src/train.py trainer.devices=\"2,3\" logger=wandb data=kit_light_final \
data.batch_size=128 data.repeat_dataset=5 trainer.max_epochs=1000 \
callbacks/model_checkpoint=t2m +model/lr_scheduler=cosine model.guidance_scale=4\
model.noise_scheduler.prediction_type=sample trainer.precision=bf16-mixed
Set model.metrics.enable_mm_metric
to True
to evaluate Multimodality. Setting model.metrics.enable_mm_metric
to False
can speed up the evaluation.
- HumanML3D
python src/eval.py trainer.devices=\"0,\" data=hml3d_light_final data.test_batch_size=128 \
model=light_final \
model.guidance_scale=4 model.noise_scheduler.prediction_type=sample\
model.denoiser.stage_dim=\"256\*4\" \
ckpt_path=\"checkpoints/hml3d.ckpt\" model.metrics.enable_mm_metric=true
- KIT-ML
We have observed that the performance of our trained model may fluctuate. Additionally, when we retrained the model on the KIT-ML dataset, we achieved improved performance with a new checkpoint (checkpoints/kit_new.ckpt).
python src/eval.py trainer.devices=\"1,\" data=kit_light_final data.test_batch_size=128 \
model=light_final \
model.guidance_scale=4 model.noise_scheduler.prediction_type=sample\
model.denoiser.stage_dim=\"256\*4\" \
ckpt_path=\"checkpoints/kit.ckpt\" model.metrics.enable_mm_metric=true
# or
python src/eval.py trainer.devices=\"1,\" data=kit_light_final data.test_batch_size=128 \
model=light_final \
model.guidance_scale=4 model.noise_scheduler.prediction_type=sample\
model.denoiser.stage_dim=\"256\*4\" \
ckpt_path=\"checkpoints/kit_new.ckpt\" model.metrics.enable_mm_metric=true
CUDA_VISIBLE_DEVICES=0 python src/test_speed.py +trainer.benchmark=true model.noise_scheduler.prediction_type=sample
python src/sample_motion.py device=\"0\" \
model.guidance_scale=4 model.noise_scheduler.prediction_type=sample\
text="A person walking and changing their path to the left." length=100
Download and unzip rendering dependencies from here. Place the rendering dependencies in the ./visual_datas/
directory.
pip install imageio bpy matplotlib smplx h5py git+https://github.com/mattloper/chumpy imageio-ffmpeg
CUDA_VISIBLE_DEVICES=0 python -W ignore visualize/blend_render.py --file_dir ./visual_datas/gen_joints --mode video --down_sample 1 --motion_list gen_motion_1 gen_motion_1
If you find this project or the paper useful in your research, please cite us:
@inproceedings{light-t2m,
title={Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation},
author={Zeng, Ling-An and Huang, Guohong and Wu, Gaojie and Zheng, Wei-Shi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2025}
}
Thanks to all open-source projects and libraries that supported our research:
T2M, MLD, T2M-GPT, TEMOS, FLAME, MoMask, Mamba
This project is licensed under the MIT License.
Note that our code depends on other libraries, including SMPL, SMPL-X, PyTorch3D, and uses datasets which each have their own respective licenses that must also be followed.