Skip to content

xyzCS/SciReplicate-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SciReplicate-Bench (COLM 2025)

Code release for SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers. [Conference on Language Modeling 2025]

File Organization

SciReplicate-Bench/
β”œβ”€β”€ utils/                                           # Core utilities for SciReproducer and evaluation metrics
β”‚   β”œβ”€β”€ __init__.py                                  # Package initialization file
β”‚   β”œβ”€β”€ CodeAgentTools.py                            # Tools and utilities for the code agent
β”‚   β”œβ”€β”€ PaperAgentTools.py                           # Tools and utilities for the paper agent
β”‚   β”œβ”€β”€ WebSearch.py                                 # Web search functionality
β”‚   β”œβ”€β”€ utils.py                                     # General utility functions
β”‚   └── Reason_Process_ACC.py                        # Reasoning graph accuracy calculation
β”‚
β”œβ”€β”€ scripts/                                         # Setup and utility scripts
β”‚   └── env.sh                                       # Script to extract and setup conda environments
β”‚
β”œβ”€β”€ envs_sci/                                        # Extracted conda environments (created by env.sh)
β”‚   β”œβ”€β”€ ColdFusion/                                  # Python environment for ColdFusion paper
β”‚   β”œβ”€β”€ order/                                       # Python environment for order paper
β”‚   β”œβ”€β”€ gac_env/                                     # Python environment for GAC paper
β”‚   └── ...                                          # Additional environments (36 total)
β”‚
β”œβ”€β”€ Benchmark/                                       # Code repositories for all benchmark papers
β”‚   β”œβ”€β”€ 0-coldfusion/                                # Source code repository for ColdFusion paper
β”‚   β”œβ”€β”€ 1-order/                                     # Source code repository for order paper
β”‚   β”œβ”€β”€ 2-gac/                                       # Source code repository for GAC paper
β”‚   └── ...                                          # Additional repositories (36 total)
β”‚
β”œβ”€β”€ Result/                                          # Experiment results and outputs after running the SciReproducer 
β”‚   β”œβ”€β”€ 0/                                           # Results for paper 0 (ColdFusion)
β”‚   β”‚   └── SciReproducer_gpt-4o-mini/               # Generated code using GPT-4o-mini
β”‚   β”‚       β”œβ”€β”€task1.pickle                          # Code output for ColdFusion task 1
β”‚   β”‚       └── ...
β”‚   └── ...                                          # Results for all papers
β”‚
β”œβ”€β”€ Data.json                                        # Main dataset containing paper information and tasks
β”œβ”€β”€ Evaluation.py                                    # Evaluation metrics (CodeBLEU, Execution Accuracy, etc.)
β”œβ”€β”€ SciReproducer.py                                 # Main dual-agent framework implementation
β”œβ”€β”€ envs_sci.zip                                     # Archive containing all conda environments
└── README.md                                        # Project documentation

1. Setting Up Python Environments for All Papers

Prerequisites

  • Conda/Miniconda installed on your system
  • envs_sci.zip file available in the root directory (link)
  • Benchmark directory. (link, please download it and unzip it)
  • Hardware Requirements:
    • Sufficient disk space for extracting 36 conda environments.
    • Ubuntu operating system.
    • CUDA Version: 12.2
    • GPU: A single NVIDIA A100 (80GB) GPU is required to execute all code repositories associated with the benchmark papers.

Setup Instructions

cd root_path
bash ./scripts/env.sh root_path

After setup all the environments, run the reference code (refer to section 4.2.1) to make sure all code repositories can run correctly.

Parameters

  • --root_path: Path to the root directory containing all project files. (Path to the 'SciReplicate-Bench' dir)

2. Setting Up Python Environments for SciReproducer

conda env create -f environment.yml

or

conda env create -f environment.yml -p path_target_env

Parameters

  • path_target_env: Path to the conda environments. (For example, {path_to_anaconda3}/envs/codegen)

3. Run the SciReproducer

Prerequisites

  • All conda environments set up (from step 1)
  • SciReproducer environment activated (from step 2)
  • API key configured for the chosen model
    • For web search tools, please follow the instructions in Section 5 to apply for a Google Search API key and a CSE ID.

Usage

Step 1: Set API Keys

# Hugginface login, and you need to apply the authentication for accessing models.
huggingface-cli login

# For Web Search Tools. (Refer to section 5 for guidance)
export GoogleSearch_API_KEY="GoogleSearch_API_KEY"
export GoogleSearch_CSEID="your_GoogleSearch_CSEID_here"

# For LLMs
export OPENAI_API_KEY="your_openai_key_here"        # Required for OpenAI models
export DEEPSEEK_API_KEY="your_deepseek_key_here"    # Optional, for DeepSeek models
export CLAUDE_API_KEY="your_claude_key_here"        # Optional, for Claude models
export GEMINI_API_KEY="your_gemini_key_here"        # Optional, for Gemini models

Step 2: Run SciReproducer

bash ./scripts/run_sci_reproducer.sh <root_path> [model]

Parameters

  • <root_path>: Path to the root directory containing all project files (the 'SciReplicate-Bench' directory)
  • [model]: (Optional) The language model to use. Default: gpt-4o-mini

Supported Models

  • OpenAI: gpt-4o, gpt-4o-mini, o3-mini
  • DeepSeek: deepseek-r1, deepseek-v3
  • Anthropic: claude-3-5-sonnet
  • Google: gemini-2.0-flash, gemini-2.0-flash-thinking

Example

# Using default model (gpt-4o-mini)
bash ./scripts/run_sci_reproducer.sh /path/to/SciReplicate-Bench

# Using specific model
bash ./scripts/run_sci_reproducer.sh /path/to/SciReplicate-Bench gpt-4o

Output Structure

The results will be saved in the specified output directory with the following structure:

Result/
β”œβ”€β”€ 0/                                  # Results for paper 0 (ColdFusion)
β”‚   └── SciReproducer_{model}/          # Generated code directory
β”‚       β”œβ”€β”€task1.pickle                 # Generated code for task 1 within ColdFusion
β”‚       └── ...
└── ...                                 # Results for all papers

Toy Example

We provide a toy example in the toy.py file, which includes the following components:

  • Code Generation Prompt Template (GENCODE): Defines a detailed prompt template to guide LLMs in generating code that adheres to a specific format for calculating reasoning graph accuracy.
  • Main Function (main): Iterates through 36 benchmark code repositories (repo_id 0–35). For each repository and each task, it outlines the step-by-step process for handling the task.

4. Evaluation Metrics

After running SciReproducer, you can evaluate the generated code using 4 different metrics.

Basic Command Structure

python Evaluation.py --metric <metric_name> --model <model_name> --root_path <root_path> --result_path <result_path> [additional_options]

Common Parameters

  • --metric: Type of evaluation metric to calculate, chosing from ['CodeBLEU_Score', 'execution_ACC', 'Recall', 'ReasoningGraph_ACC'].
  • --model: Model name used for code generation.
  • --root_path: Path to the root directory containing all project files (the 'SciReplicate-Bench' directory).
  • --gpu_id: GPU ID to use for execution (default: 0)

4.1 CodeBLEU Score|Recall|Reasoning Graph Acc

export OPENAI_API_KEY="your_openai_key_here"        
python Evaluation.py \
  --metric [CodeBLEU_Score|Recall|ReasoningGraph_ACC] \
  --model gpt-4o-mini \
  --root_path /path/to/SciReplicate-Bench \
  --result_path /path/to/SciReplicate-Bench/Result

4.2 Execution Accuracy

4.2.1 Obtain the output of the reference code. Due to the difference of different machines, you need to run the reference code on your machine to obtain the reference output.

export OPENAI_API_KEY="your_openai_key_here"
python Evaluation.py \
  --metric execution_ACC \
  --model gpt-4o-mini \
  --gpu_id 0 \
  --root_path /path/to/SciReplicate-Bench \
  --reference \

4.2.2 Evaluate the output of the generated code. Obtain the output of the generated code and compare the generated output with the reference output.

export OPENAI_API_KEY="your_openai_key_here"
python Evaluation.py \
  --metric execution_ACC \
  --model gpt-4o-mini \
  --gpu_id 0 \
  --root_path /path/to/SciReplicate-Bench \

5. Google Search API Setup

How to Get These Credentials:

Google Search API Key:

  1. Go to Google Cloud Console
  2. Create a new project or select existing one
  3. Enable the "Custom Search JSON API"
  4. Go to "Credentials" β†’ "Create Credentials" β†’ "API Key"

Google Custom Search Engine ID (CSE ID):

  1. Go to Google Custom Search
  2. Create a new search engine
  3. Set it to search the entire web
  4. Copy the Search Engine ID from the control panel

Setting the Environment Variables:

export GoogleSearch_API_KEY="your_google_search_api_key"
export GoogleSearch_CSEID="your_custom_search_engine_id"

Reference

@article{xiang2025scireplicate,
  title={Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers},
  author={Xiang, Yanzheng and Yan, Hanqi and Ouyang, Shuyin and Gui, Lin and He, Yulan},
  journal={arXiv preprint arXiv:2504.00255},
  year={2025}
}

About

The dataset and code for paper "SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors