Official repository for our paper:
VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs
If you find this project useful in your research, please consider citing:
@article{daneshvar2024exploringragbasedvulnerabilityaugmentation,
title={Exploring RAG-based Vulnerability Augmentation with LLMs},
author={Seyed Shayan Daneshvar and Yu Nong and Xu Yang and Shaowei Wang and Haipeng Cai},
year={2024},
eprint={2408.04125},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2408.04125},
}
Bigvul_train,
Bigvul test,
Bigvul_val
Reveal,
Devign,
PrimeVul (RQ4 only)
VGX Full dataset,
Vulgen Full dataset from VGX paper
All pair matching (except for RQ4), including for mutation and random ones for RQ2
RQ4's pair matching/retriver output
Filtered Datasets for RQs(1-3),
Unfiltered Datasets for RQs(1-3),
Unfiltered Datasets for RQ4
The unfiltered dataset contains samples from the Generator and hasn't gone through the Verification phase. They also include extra metadata that shows which clean_vul pair was used for generation, plus the vul lines.
Go to the models directory, the readme for each model explains how to use each of the models