codelion
diff --git a/‎README.md
Lines changed: 15 additions & 10 deletions b/‎README.md
Lines changed: 15 additions & 10 deletions
diff --git a/‎optillm/cepo/README.md
Lines changed: 2 additions & 18 deletions b/‎optillm/cepo/README.md
Lines changed: 2 additions & 18 deletions
@@ -393,7 +393,7 @@ optillm supports various command-line arguments for configuration. When using Do
 | `--return-full-response` | Return the full response including the CoT with <thinking> tags | `False`         |
 | `--port`                 | Specify the port to run the proxy                               | 8000            |
 | `--optillm-api-key`      | Optional API key for client authentication to optillm           | `""`            |
-| `--cepo_*` | See CePO Parameters section below for detailed configuration options | Various |
+| `--cepo_*`               | See CePO Parameters section below for detailed config options   | Various         |
 
 <details>
 <summary><strong>CePO Parameters</strong></summary>
@@ -415,7 +415,9 @@ optillm supports various command-line arguments for configuration. When using Do
 | `--cepo_planning_max_tokens_step3` | Maximum number of tokens in step 3 of planning stage | 4096 |
 | `--cepo_planning_max_tokens_step4` | Maximum number of tokens in step 4 of planning stage | 4096 |
 | `--cepo_print_output` | Whether to print the output of each stage | `False` |
-| `--cepo_config_file` | Path to CePO configuration file | None |
+| `--cepo_config_file` | Path to CePO configuration file | `None` |
+| `--cepo_use_plan_diversity` | Use additional plan diversity step | `False` |
+| `--cepo_rating_model` | Specify a model for rating step if different than for completion | `None` |
 
 </details>
 
@@ -464,14 +466,17 @@ Authorization: Bearer your_secret_api_key
 
 ## SOTA results on benchmarks with optillm
 
-### CePO on math and code benchmarks (Jan 2025)
-
-| Method                     | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | LiveCodeBench (pass@1) | Simple QA |
-| -------------------------: | :-----: | :-------------: | :--: | :--: | :--------------------: | :-------: |
-| Llama 3.1 70B              |  41.6   |      72.9       | 41.7 | 64.2 |          24.5          |    14.7   |
-| Llama 3.3 70B              |  51.0   |      78.6       | 49.1 | 72.6 |          27.1          |    20.9   |
-| Llama 3.1 405B             |  49.8   |      79.2       | 50.7 | 73.0 |          31.8          |    13.5   |
-| CePO (using Llama 3.3 70B) |  69.6   |      84.8       | 55.5 | 80.1 |          31.9          |    22.6   |
+### CePO on math and code benchmarks (Mar 2025)
+
+| Method                         | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
+| -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
+| Llama 3.3 70B                  |  51.0   |      78.6       |  72.6  |          27.1          |    20.9   |
+| Llama 3.1 405B                 |  49.8   |      79.2       |  73.0  |          31.8          |    13.5   |
+| CePO (using Llama 3.3 70B)     |  69.6   |      84.8       |  80.1  |          31.9          |  **22.6** |
+| QwQ 32B                        |  61.4   |      90.8       |  82.5  |          44.3          |    7.8    |
+| CePO (using QwQ 32B)           |  88.1   |    **92.0**     |  86.3  |        **51.5**        |    8.2    |
+| DeepSeek R1 Llama              |  83.1   |      82.0       |  84.0  |          47.3          |    14.6   |
+| CePO (using DeepSeek R1 Llama) |**90.2** |      84.0       |**89.4**|          47.2          |    15.5   |
 
 ### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)
 
 
@@ -6,7 +6,7 @@ If you have any questions or want to contribute, please reach out to us on [cere
 
 ## CePO Methodology
 
-In CePO, the Best of N technique is applied to `bestofn_n` solution candidates. Each solution is generated through the following four steps:
+In CePO, the Best of N technique is applied to `bestofn_n` solution candidates. Optionally (when `cepo_use_plan_diversity` is set to `True`), the model will attempt to come up with diverse approaches for each of best of n completions. Each completion is generated through the following four steps:
 
 **Step 1**: Plan Generation
 The model generates a detailed, step-by-step plan to solve the problem, along with its confidence level for each step.
@@ -25,20 +25,4 @@ The model uses the refined plan from Step 3 to produce the final answer.
 
 ## CePO Current Status
 
-This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved by task-specific customizations to prompts.
-
-## CePO Ablation studies
-
-We conducted ablation studies to evaluate the impact of various hyperparameters in the CePO framework. Our results indicate that the chosen hyperparameter settings strike a good balance between computational cost and accuracy.
-
-Interestingly, the self-critique and quality improvement capabilities of existing off-the-shelf models do not always scale proportionally with increased inference compute. Addressing this limitation remains a key focus, and we plan to explore custom model fine-tuning as a potential solution in the future.
-
-| bestofn_n | planning_n | planning_m | bestofn_rating_type | Math-L5 | MMLU-Pro (Math) | GPQA  | CRUX  | Comments       |
-| :-------: | :--------: | :--------: | :-----------------: | :-----: | :-------------: | :---: | :---: | :------------- |
-|     3     |      3     |      6     |       absolute      |  69.6   |      84.8       | 55.5  | 80.1  | Default config |
-|     3     |      3     |      6     |       pairwise      |  67.7   |      83.5       | 55.6  | 79.8  |                |
-|     3     |      2     |      5     |       absolute      |  67.1   |      85.1       | 55.1  | 79.0  |                |
-|     3     |      5     |      8     |       absolute      |  69.4   |      84.3       | 55.6  | 81.1  |                |
-|     5     |      3     |      6     |       absolute      |  68.7   |      85.4       | 54.8  | 79.9  |                |
-|     7     |      3     |      6     |       absolute      |  69.6   |      82.8       | 54.7  | 78.4  |                |
-|     9     |      3     |      6     |       absolute      |  68.9   |      83.4       | 55.7  | 80.6  |                |
+This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved by task-specific customizations to prompts.