Skip to content

Commit 232f8eb

Browse files
authored
Merge pull request #175 from CerebrasResearch/main
Add plan diversity and update rating prompt in CePO
2 parents 1dc8687 + 71802b9 commit 232f8eb

File tree

4 files changed

+128
-65
lines changed

4 files changed

+128
-65
lines changed

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -393,7 +393,7 @@ optillm supports various command-line arguments for configuration. When using Do
393393
| `--return-full-response` | Return the full response including the CoT with <thinking> tags | `False` |
394394
| `--port` | Specify the port to run the proxy | 8000 |
395395
| `--optillm-api-key` | Optional API key for client authentication to optillm | `""` |
396-
| `--cepo_*` | See CePO Parameters section below for detailed configuration options | Various |
396+
| `--cepo_*` | See CePO Parameters section below for detailed config options | Various |
397397

398398
<details>
399399
<summary><strong>CePO Parameters</strong></summary>
@@ -415,7 +415,9 @@ optillm supports various command-line arguments for configuration. When using Do
415415
| `--cepo_planning_max_tokens_step3` | Maximum number of tokens in step 3 of planning stage | 4096 |
416416
| `--cepo_planning_max_tokens_step4` | Maximum number of tokens in step 4 of planning stage | 4096 |
417417
| `--cepo_print_output` | Whether to print the output of each stage | `False` |
418-
| `--cepo_config_file` | Path to CePO configuration file | None |
418+
| `--cepo_config_file` | Path to CePO configuration file | `None` |
419+
| `--cepo_use_plan_diversity` | Use additional plan diversity step | `False` |
420+
| `--cepo_rating_model` | Specify a model for rating step if different than for completion | `None` |
419421

420422
</details>
421423

@@ -464,14 +466,17 @@ Authorization: Bearer your_secret_api_key
464466

465467
## SOTA results on benchmarks with optillm
466468

467-
### CePO on math and code benchmarks (Jan 2025)
468-
469-
| Method | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | LiveCodeBench (pass@1) | Simple QA |
470-
| -------------------------: | :-----: | :-------------: | :--: | :--: | :--------------------: | :-------: |
471-
| Llama 3.1 70B | 41.6 | 72.9 | 41.7 | 64.2 | 24.5 | 14.7 |
472-
| Llama 3.3 70B | 51.0 | 78.6 | 49.1 | 72.6 | 27.1 | 20.9 |
473-
| Llama 3.1 405B | 49.8 | 79.2 | 50.7 | 73.0 | 31.8 | 13.5 |
474-
| CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 55.5 | 80.1 | 31.9 | 22.6 |
469+
### CePO on math and code benchmarks (Mar 2025)
470+
471+
| Method | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
472+
| -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
473+
| Llama 3.3 70B | 51.0 | 78.6 | 72.6 | 27.1 | 20.9 |
474+
| Llama 3.1 405B | 49.8 | 79.2 | 73.0 | 31.8 | 13.5 |
475+
| CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 80.1 | 31.9 | **22.6** |
476+
| QwQ 32B | 61.4 | 90.8 | 82.5 | 44.3 | 7.8 |
477+
| CePO (using QwQ 32B) | 88.1 | **92.0** | 86.3 | **51.5** | 8.2 |
478+
| DeepSeek R1 Llama | 83.1 | 82.0 | 84.0 | 47.3 | 14.6 |
479+
| CePO (using DeepSeek R1 Llama) |**90.2** | 84.0 |**89.4**| 47.2 | 15.5 |
475480

476481
### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)
477482

optillm/cepo/README.md

Lines changed: 2 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ If you have any questions or want to contribute, please reach out to us on [cere
66

77
## CePO Methodology
88

9-
In CePO, the Best of N technique is applied to `bestofn_n` solution candidates. Each solution is generated through the following four steps:
9+
In CePO, the Best of N technique is applied to `bestofn_n` solution candidates. Optionally (when `cepo_use_plan_diversity` is set to `True`), the model will attempt to come up with diverse approaches for each of best of n completions. Each completion is generated through the following four steps:
1010

1111
**Step 1**: Plan Generation
1212
The model generates a detailed, step-by-step plan to solve the problem, along with its confidence level for each step.
@@ -25,20 +25,4 @@ The model uses the refined plan from Step 3 to produce the final answer.
2525

2626
## CePO Current Status
2727

28-
This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved by task-specific customizations to prompts.
29-
30-
## CePO Ablation studies
31-
32-
We conducted ablation studies to evaluate the impact of various hyperparameters in the CePO framework. Our results indicate that the chosen hyperparameter settings strike a good balance between computational cost and accuracy.
33-
34-
Interestingly, the self-critique and quality improvement capabilities of existing off-the-shelf models do not always scale proportionally with increased inference compute. Addressing this limitation remains a key focus, and we plan to explore custom model fine-tuning as a potential solution in the future.
35-
36-
| bestofn_n | planning_n | planning_m | bestofn_rating_type | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | Comments |
37-
| :-------: | :--------: | :--------: | :-----------------: | :-----: | :-------------: | :---: | :---: | :------------- |
38-
| 3 | 3 | 6 | absolute | 69.6 | 84.8 | 55.5 | 80.1 | Default config |
39-
| 3 | 3 | 6 | pairwise | 67.7 | 83.5 | 55.6 | 79.8 | |
40-
| 3 | 2 | 5 | absolute | 67.1 | 85.1 | 55.1 | 79.0 | |
41-
| 3 | 5 | 8 | absolute | 69.4 | 84.3 | 55.6 | 81.1 | |
42-
| 5 | 3 | 6 | absolute | 68.7 | 85.4 | 54.8 | 79.9 | |
43-
| 7 | 3 | 6 | absolute | 69.6 | 82.8 | 54.7 | 78.4 | |
44-
| 9 | 3 | 6 | absolute | 68.9 | 83.4 | 55.7 | 80.6 | |
28+
This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved by task-specific customizations to prompts.

0 commit comments

Comments
 (0)