Add GenSelect plugin for solution selection #215

codelion · 2025-07-19T09:48:27Z

Introduces the GenSelect plugin implementing generative solution selection based on the AIMO-2 winning solution paper. Updates README with plugin documentation and reference, bumps version to 0.1.23, and adds a GenSelect math test case.

Introduces a default test-time compute configuration with pass@1, maj@64, and genselect@64 approaches for standard evaluation. Updates evaluation logic to support multiple runs for pass@1, adjusts report generation to highlight these modes, and refactors main() to use the new defaults when no approaches are specified.

Changed num_candidates validation to only enforce a minimum of 2, removing the previous maximum limit of 16. This allows for generating more than 16 candidates if desired.

Introduces a new tests/ directory with unit and integration tests, a requirements file, and a test runner script. Adds a GitHub Actions workflow for automated testing on multiple Python versions and pull requests. Updates README with detailed testing instructions. Refactors optillm.py and eval_optillmbench.py to improve n parameter handling and test-time compute evaluation logic.

Upgrades the majority voting plugin with category-aware answer extraction, adaptive temperature control, improved normalization, response quality filtering, and smart fallback strategies. Updates the default test-time compute configuration in eval_optillmbench.py to use 5 candidates instead of 8 for fairer comparison and memory efficiency, and revises related reporting logic and documentation.

Replaced category-specific temperature settings with a unified temperature of 0.6 across majority voting and evaluation scripts for consistency. Simplified answer extraction and majority voting logic in the plugin to match evaluation script, removing quality scoring and normalization heuristics.

codelion · 2025-07-24T06:49:06Z

Test-Time Compute Evaluation Results

Model Comparison

Model	Approach	Overall Accuracy	GSM8K	MMLU_MATH	BOOLQ	AQUA_RAT	Avg Time (s)
Gemini Flash	avg@5	60.80%	72.14%	64.55%	96.52%	15.56%	0.89
	pass@5	70.00% (+15.1%)	75.00%	68.18%	95.65%	44.44%	0.79
	maj@5	68.00% (+11.8%)	78.57%	68.18%	100.00%	29.63%	0.84
	genselect@5	67.00% (+10.2%)	82.14%	72.73%	95.65%	22.22%	3.20
Qwen3-0.6B-8bit	avg@5	51.49%	60.71%	50.91%	86.09%	14.81%	5.84
	pass@5	59.59% (+15.7%)	67.86%	69.09%	86.96%	22.22%	5.64
	maj@5	53.00% (+2.9%)	67.86%	54.55%	86.96%	3.70%	5.62
	genselect@5	55.00% (+6.8%)	64.29%	58.18%	86.96%	11.11%	18.78
Qwen3-0.6B-4bit	avg@5	39.20%	64.29%	29.09%	62.61%	1.48%	334.55
	pass@5	54.00% (+37.8%)	75.00%	59.09%	82.61%	3.70%	30.92
	maj@5	42.00% (+7.1%)	71.43%	31.82%	60.87%	3.70%	30.55
	genselect@5	41.00% (+4.6%)	60.71%	27.27%	69.57%	7.41%	68.27

Key Insights

Model Performance

Gemini Flash: Best overall performance (60.80% baseline), excellent on BOOLQ (96-100%)
Qwen3-0.6B-8bit: Moderate performance (51.49% baseline), consistent across categories
Qwen3-0.6B-4bit: Significant degradation (39.20% baseline), severe timing issues

Test-Time Compute Benefits

High-quality models (Gemini Flash): Moderate gains (10-15%)
Medium-quality models (Qwen 8-bit): Similar gains (3-16%)
Low-quality models (Qwen 4-bit): Dramatic gains (up to 38%) due to high variance

Category-Specific Patterns

GSM8K: GenSelect excels with larger models, Pass@5 best for quantized
MMLU_MATH: Pass@5 consistently strong across all models
BOOLQ: Majority voting achieves perfect accuracy on Gemini Flash
AQUA_RAT: Pass@5 dramatically improves results (3x on Gemini Flash)

Timing Considerations

Gemini Flash: Sub-second for most approaches
Qwen 8-bit: 5-19 seconds, reasonable for local model
Qwen 4-bit: Extreme delays on avg@5 (334s), suggesting implementation issues

Response Variance Indicator (pass@5 - avg@5 gap)

Gemini Flash: 15.1% (low variance, high quality)
Qwen 8-bit: 15.7% (low variance, moderate quality)
Qwen 4-bit: 37.8% (high variance, low quality)

Recommendations

For production: Use high-quality models with maj@5 for balanced accuracy/speed
For best accuracy: Use genselect@5 with sufficient compute budget
For uncertain domains: Use pass@5 to maximize coverage
For quantized models: Pass@5 provides significant accuracy recovery

codelion added 10 commits July 19, 2025 17:45

Add GenSelect plugin for solution selection

791df94

Introduces the GenSelect plugin implementing generative solution selection based on the AIMO-2 winning solution paper. Updates README with plugin documentation and reference, bumps version to 0.1.23, and adds a GenSelect math test case.

Update eval_optillmbench.py

0dc0ed6

Relax upper bound for num_candidates validation

1daa2a0

Changed num_candidates validation to only enforce a minimum of 2, removing the previous maximum limit of 16. This allows for generating more than 16 candidates if desired.

Update test_plugins.py

a2cd56c

Update majority_voting_plugin.py

c5ca795

Update test.yml

7980b46

codelion merged commit aabd5f1 into main Jul 24, 2025
3 checks passed

codelion deleted the feat-add-genselect branch July 24, 2025 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GenSelect plugin for solution selection #215

Add GenSelect plugin for solution selection #215

Uh oh!

codelion commented Jul 19, 2025

Uh oh!

codelion commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add GenSelect plugin for solution selection #215

Add GenSelect plugin for solution selection #215

Uh oh!

Conversation

codelion commented Jul 19, 2025

Uh oh!

codelion commented Jul 24, 2025

Test-Time Compute Evaluation Results

Model Comparison

Key Insights

Model Performance

Test-Time Compute Benefits

Category-Specific Patterns

Timing Considerations

Response Variance Indicator (pass@5 - avg@5 gap)

Recommendations

Uh oh!

Uh oh!

Uh oh!