(a) Overview of UniScene. Given BEV layouts, UniScene facilitates versatile data generation, including semantic occupancy, multi-view video, and LiDAR point clouds, through an occupancy-centric hierarchical modeling approach. (b) Performance comparison on different generation tasks. UniScene delivers substantial improvements over SOTA methods in video, LiDAR, and occupancy generation.
TL; DR The first unified framework for generating three key data forms — semantic occupancy, video, and LiDAR — in driving scenes.
Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms — semantic occupancy, video, and LiDAR — in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks.
- [2025/03]: Check out our other latest works on generative world models: MuDG, DiST-4D, HERMES.
- [2025/03]: Code and pre-trained weights are released.
- [2025/02]: Paper is accepted on CVPR 2025.
- [2024/12]: Paper is on the arxiv.
- [2024/12]: Demo is released on the Project Page.
a. We recommend to create environment using the following script with "poetry", which have been tested on NVIDIA A100/A800 with CUDA 12.1 and Python 3.9:
b. Pretrained Models. The huggingface-cli tool is available with:
pip install -U huggingface_hub
huggingface-cli download --resume-download Arlolo0/UniScene_path --local-dir $local_path
Model | OneDrive |
|
---|---|---|
Occupancy Model | Occuancy-OneDrive | Occ-VAE Occ-DiT |
LiDAR Model | LiDAR-OneDrive | LiDAR-HF |
Video Model | Video-OneDrive | Video-HF |
a. Download all splits of Trainval in Full dataset (v1.0) to your device following official instructions and put them to local folder "./data/nuscenes".
$./data/nuscenes
├── samples
├── sweeps
├── ...
└── v1.0-trainval
b. Downloaded interpolated 12Hz annotation and mmdet3d meta data on "nuscenes_interp_12Hz_infos_*.pkl" and put them to "data/nuscenes_mmdet3d-12Hz/".
c. (Optional) Prepare 12HZ 3D Occupancy Labels with LiDAR and 3D bbox labels. Note that we use NKSR reconstruction to produce GT occupancy with the pc-range of [-50, -50, -5, 50, 50, 3], which is aligned with OpenOccupancy. The pc-range of the generated occupancy from our occupancy model is set to [-40, -40, -1, 40, 40, 5.4] for Occ3D baseline comparisons.
To generate ground-truth semantic occupancy with resolution of 200×200×16:
cd ./data_process && python generate_occ.py --config_path config-200.yaml --save_path $GT_nksr_occupancy_200/
To generate ground-truth semantic occupancy with resolution of 800×800×64:
cd ./data_process && python generate_occ.py --config_path config-800.yaml --save_path $GT_nksr_occupancy_800/
Note that this will take up about 3.1 TB of hard disk space.
d. (Optional) Prepare 12HZ BEV layouts.
cd ./occupancy_gen/12hz_processing && python save_bevlayout_12hz.py \
--py-config="./config/save_step2_me.py" \
--work-dir="./ckpt/VAE/"
Note that the road lines from the BEV layouts are projected onto the semantic occupancy, integrating the corresponding semantic information. For more details, please refer to our UniScene paper.
-
Overall running instruction:
You can link the "./data" folder to the following subfolders for convenient access:
ln -s ./data $sub_folder
a. Our framework starts with generating semantic occupancy from given BEV layouts as:
cd ./occupancy_gen python save_bev_layout.py \ --py-config="./config/save_step2_me.py" \ --work-dir="./ckpt/VAE/" bash ./run_eval_dit.sh
b. Generating LiDAR point clouds from semantic occupancy as:
cd ./lidar_gen python tools/test.py --save_to_file $output_lidar_path
c. Generating multi-view video from semantic occupancy as:
cd ./video_gen python ./gs_render/render_eval_condition_gt.py --occ_path $input_occ_path --layout_path $input_bev_path --render_path $output_render_path --vis python inference_video.py --occ_data_root $output_render_path --save $output_video_path
-
More details on Occupancy, LIDAR, and Video generation. We leverage Mayavi for 3D occupancy and LiDAR point clouds visualization:
-
Occuancy Generation. Please refer to Occupancy_vis for visualization.
-
LiDAR Generation. Please refer to LiDAR_vis for visualization.
-
Video Generation. The generated video will be saved at "./video_gen/outputs".
-
Our implementation is based on the excellent open source projects: Occworld, OpenPCDet, SVD.
This repository is released under the Apache 2.0 license as found in the LICENSE file.
If you find our paper and code useful for your research, please consider citing us and giving a star to our repository:
@article{li2024uniscene,
title={UniScene: Unified Occupancy-centric Driving Scene Generation},
author={Li, Bohan and Guo, Jiazhe and Liu, Hongsi and Zou, Yingshuang and Ding, Yikang and Chen, Xiwu and Zhu, Hu and Tan, Feiyang and Zhang, Chi and Wang, Tiancai and others},
journal={arXiv preprint arXiv:2412.05435},
year={2024}
}