Code release for SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers. [Conference on Language Modeling 2025]
SciReplicate-Bench/
βββ utils/ # Core utilities for SciReproducer and evaluation metrics
β βββ __init__.py # Package initialization file
β βββ CodeAgentTools.py # Tools and utilities for the code agent
β βββ PaperAgentTools.py # Tools and utilities for the paper agent
β βββ WebSearch.py # Web search functionality
β βββ utils.py # General utility functions
β βββ Reason_Process_ACC.py # Reasoning graph accuracy calculation
β
βββ scripts/ # Setup and utility scripts
β βββ env.sh # Script to extract and setup conda environments
β
βββ envs_sci/ # Extracted conda environments (created by env.sh)
β βββ ColdFusion/ # Python environment for ColdFusion paper
β βββ order/ # Python environment for order paper
β βββ gac_env/ # Python environment for GAC paper
β βββ ... # Additional environments (36 total)
β
βββ Benchmark/ # Code repositories for all benchmark papers
β βββ 0-coldfusion/ # Source code repository for ColdFusion paper
β βββ 1-order/ # Source code repository for order paper
β βββ 2-gac/ # Source code repository for GAC paper
β βββ ... # Additional repositories (36 total)
β
βββ Result/ # Experiment results and outputs after running the SciReproducer
β βββ 0/ # Results for paper 0 (ColdFusion)
β β βββ SciReproducer_gpt-4o-mini/ # Generated code using GPT-4o-mini
β β βββtask1.pickle # Code output for ColdFusion task 1
β β βββ ...
β βββ ... # Results for all papers
β
βββ Data.json # Main dataset containing paper information and tasks
βββ Evaluation.py # Evaluation metrics (CodeBLEU, Execution Accuracy, etc.)
βββ SciReproducer.py # Main dual-agent framework implementation
βββ envs_sci.zip # Archive containing all conda environments
βββ README.md # Project documentation
- Conda/Miniconda installed on your system
envs_sci.zipfile available in the root directory (link)Benchmarkdirectory. (link, please download it and unzip it)- Hardware Requirements:
- Sufficient disk space for extracting 36 conda environments.
- Ubuntu operating system.
- CUDA Version: 12.2
- GPU: A single NVIDIA A100 (80GB) GPU is required to execute all code repositories associated with the benchmark papers.
cd root_path
bash ./scripts/env.sh root_pathAfter setup all the environments, run the reference code (refer to section 4.2.1) to make sure all code repositories can run correctly.
--root_path: Path to the root directory containing all project files. (Path to the 'SciReplicate-Bench' dir)
conda env create -f environment.yml
or
conda env create -f environment.yml -p path_target_env
path_target_env: Path to the conda environments. (For example, {path_to_anaconda3}/envs/codegen)
- All conda environments set up (from step 1)
- SciReproducer environment activated (from step 2)
- API key configured for the chosen model
- For web search tools, please follow the instructions in Section 5 to apply for a Google Search API key and a CSE ID.
# Hugginface login, and you need to apply the authentication for accessing models.
huggingface-cli login
# For Web Search Tools. (Refer to section 5 for guidance)
export GoogleSearch_API_KEY="GoogleSearch_API_KEY"
export GoogleSearch_CSEID="your_GoogleSearch_CSEID_here"
# For LLMs
export OPENAI_API_KEY="your_openai_key_here" # Required for OpenAI models
export DEEPSEEK_API_KEY="your_deepseek_key_here" # Optional, for DeepSeek models
export CLAUDE_API_KEY="your_claude_key_here" # Optional, for Claude models
export GEMINI_API_KEY="your_gemini_key_here" # Optional, for Gemini modelsbash ./scripts/run_sci_reproducer.sh <root_path> [model]<root_path>: Path to the root directory containing all project files (the 'SciReplicate-Bench' directory)[model]: (Optional) The language model to use. Default:gpt-4o-mini
- OpenAI:
gpt-4o,gpt-4o-mini,o3-mini - DeepSeek:
deepseek-r1,deepseek-v3 - Anthropic:
claude-3-5-sonnet - Google:
gemini-2.0-flash,gemini-2.0-flash-thinking
# Using default model (gpt-4o-mini)
bash ./scripts/run_sci_reproducer.sh /path/to/SciReplicate-Bench
# Using specific model
bash ./scripts/run_sci_reproducer.sh /path/to/SciReplicate-Bench gpt-4oThe results will be saved in the specified output directory with the following structure:
Result/
βββ 0/ # Results for paper 0 (ColdFusion)
β βββ SciReproducer_{model}/ # Generated code directory
β βββtask1.pickle # Generated code for task 1 within ColdFusion
β βββ ...
βββ ... # Results for all papers
We provide a toy example in the toy.py file, which includes the following components:
- Code Generation Prompt Template (GENCODE): Defines a detailed prompt template to guide LLMs in generating code that adheres to a specific format for calculating reasoning graph accuracy.
- Main Function (main): Iterates through 36 benchmark code repositories (repo_id 0β35). For each repository and each task, it outlines the step-by-step process for handling the task.
After running SciReproducer, you can evaluate the generated code using 4 different metrics.
python Evaluation.py --metric <metric_name> --model <model_name> --root_path <root_path> --result_path <result_path> [additional_options]--metric: Type of evaluation metric to calculate, chosing from ['CodeBLEU_Score', 'execution_ACC', 'Recall', 'ReasoningGraph_ACC'].--model: Model name used for code generation.--root_path: Path to the root directory containing all project files (the 'SciReplicate-Bench' directory).--gpu_id: GPU ID to use for execution (default: 0)
export OPENAI_API_KEY="your_openai_key_here"
python Evaluation.py \
--metric [CodeBLEU_Score|Recall|ReasoningGraph_ACC] \
--model gpt-4o-mini \
--root_path /path/to/SciReplicate-Bench \
--result_path /path/to/SciReplicate-Bench/Result4.2.1 Obtain the output of the reference code. Due to the difference of different machines, you need to run the reference code on your machine to obtain the reference output.
export OPENAI_API_KEY="your_openai_key_here"
python Evaluation.py \
--metric execution_ACC \
--model gpt-4o-mini \
--gpu_id 0 \
--root_path /path/to/SciReplicate-Bench \
--reference \4.2.2 Evaluate the output of the generated code. Obtain the output of the generated code and compare the generated output with the reference output.
export OPENAI_API_KEY="your_openai_key_here"
python Evaluation.py \
--metric execution_ACC \
--model gpt-4o-mini \
--gpu_id 0 \
--root_path /path/to/SciReplicate-Bench \Google Search API Key:
- Go to Google Cloud Console
- Create a new project or select existing one
- Enable the "Custom Search JSON API"
- Go to "Credentials" β "Create Credentials" β "API Key"
Google Custom Search Engine ID (CSE ID):
- Go to Google Custom Search
- Create a new search engine
- Set it to search the entire web
- Copy the Search Engine ID from the control panel
export GoogleSearch_API_KEY="your_google_search_api_key"
export GoogleSearch_CSEID="your_custom_search_engine_id"@article{xiang2025scireplicate,
title={Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers},
author={Xiang, Yanzheng and Yan, Hanqi and Ouyang, Shuyin and Gui, Lin and He, Yulan},
journal={arXiv preprint arXiv:2504.00255},
year={2025}
}