Skip to content

Add GenSelect plugin for solution selection #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 24, 2025
Merged

Conversation

codelion
Copy link
Owner

Introduces the GenSelect plugin implementing generative solution selection based on the AIMO-2 winning solution paper. Updates README with plugin documentation and reference, bumps version to 0.1.23, and adds a GenSelect math test case.

codelion added 10 commits July 19, 2025 17:45
Introduces the GenSelect plugin implementing generative solution selection based on the AIMO-2 winning solution paper. Updates README with plugin documentation and reference, bumps version to 0.1.23, and adds a GenSelect math test case.
Introduces a default test-time compute configuration with pass@1, maj@64, and genselect@64 approaches for standard evaluation. Updates evaluation logic to support multiple runs for pass@1, adjusts report generation to highlight these modes, and refactors main() to use the new defaults when no approaches are specified.
Changed num_candidates validation to only enforce a minimum of 2, removing the previous maximum limit of 16. This allows for generating more than 16 candidates if desired.
Introduces a new tests/ directory with unit and integration tests, a requirements file, and a test runner script. Adds a GitHub Actions workflow for automated testing on multiple Python versions and pull requests. Updates README with detailed testing instructions. Refactors optillm.py and eval_optillmbench.py to improve n parameter handling and test-time compute evaluation logic.
Upgrades the majority voting plugin with category-aware answer extraction, adaptive temperature control, improved normalization, response quality filtering, and smart fallback strategies. Updates the default test-time compute configuration in eval_optillmbench.py to use 5 candidates instead of 8 for fairer comparison and memory efficiency, and revises related reporting logic and documentation.
Replaced category-specific temperature settings with a unified temperature of 0.6 across majority voting and evaluation scripts for consistency. Simplified answer extraction and majority voting logic in the plugin to match evaluation script, removing quality scoring and normalization heuristics.
@codelion
Copy link
Owner Author

Test-Time Compute Evaluation Results

Model Comparison

Model Approach Overall Accuracy GSM8K MMLU_MATH BOOLQ AQUA_RAT Avg Time (s)
Gemini Flash avg@5 60.80% 72.14% 64.55% 96.52% 15.56% 0.89
pass@5 70.00% (+15.1%) 75.00% 68.18% 95.65% 44.44% 0.79
maj@5 68.00% (+11.8%) 78.57% 68.18% 100.00% 29.63% 0.84
genselect@5 67.00% (+10.2%) 82.14% 72.73% 95.65% 22.22% 3.20
Qwen3-0.6B-8bit avg@5 51.49% 60.71% 50.91% 86.09% 14.81% 5.84
pass@5 59.59% (+15.7%) 67.86% 69.09% 86.96% 22.22% 5.64
maj@5 53.00% (+2.9%) 67.86% 54.55% 86.96% 3.70% 5.62
genselect@5 55.00% (+6.8%) 64.29% 58.18% 86.96% 11.11% 18.78
Qwen3-0.6B-4bit avg@5 39.20% 64.29% 29.09% 62.61% 1.48% 334.55
pass@5 54.00% (+37.8%) 75.00% 59.09% 82.61% 3.70% 30.92
maj@5 42.00% (+7.1%) 71.43% 31.82% 60.87% 3.70% 30.55
genselect@5 41.00% (+4.6%) 60.71% 27.27% 69.57% 7.41% 68.27

Key Insights

Model Performance

  • Gemini Flash: Best overall performance (60.80% baseline), excellent on BOOLQ (96-100%)
  • Qwen3-0.6B-8bit: Moderate performance (51.49% baseline), consistent across categories
  • Qwen3-0.6B-4bit: Significant degradation (39.20% baseline), severe timing issues

Test-Time Compute Benefits

  • High-quality models (Gemini Flash): Moderate gains (10-15%)
  • Medium-quality models (Qwen 8-bit): Similar gains (3-16%)
  • Low-quality models (Qwen 4-bit): Dramatic gains (up to 38%) due to high variance

Category-Specific Patterns

  • GSM8K: GenSelect excels with larger models, Pass@5 best for quantized
  • MMLU_MATH: Pass@5 consistently strong across all models
  • BOOLQ: Majority voting achieves perfect accuracy on Gemini Flash
  • AQUA_RAT: Pass@5 dramatically improves results (3x on Gemini Flash)

Timing Considerations

  • Gemini Flash: Sub-second for most approaches
  • Qwen 8-bit: 5-19 seconds, reasonable for local model
  • Qwen 4-bit: Extreme delays on avg@5 (334s), suggesting implementation issues

Response Variance Indicator (pass@5 - avg@5 gap)

  • Gemini Flash: 15.1% (low variance, high quality)
  • Qwen 8-bit: 15.7% (low variance, moderate quality)
  • Qwen 4-bit: 37.8% (high variance, low quality)

Recommendations

  1. For production: Use high-quality models with maj@5 for balanced accuracy/speed
  2. For best accuracy: Use genselect@5 with sufficient compute budget
  3. For uncertain domains: Use pass@5 to maximize coverage
  4. For quantized models: Pass@5 provides significant accuracy recovery

@codelion codelion merged commit aabd5f1 into main Jul 24, 2025
3 checks passed
@codelion codelion deleted the feat-add-genselect branch July 24, 2025 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant