|
| 1 | +# System Prompt Learning (SPL) Plugin for OptiLLM |
| 2 | + |
| 3 | +This plugin implements Andrej Karpathy's [proposed](https://x.com/karpathy/status/1921368644069765486) "third paradigm" for LLM learning, enabling large language models to learn and improve their problem-solving strategies over time through experience and reflection. |
| 4 | + |
| 5 | +## Introduction: The Evolution of LLM Learning |
| 6 | + |
| 7 | +Large Language Models (LLMs) have traditionally learned in two primary ways: |
| 8 | +1. **Pretraining**: Learning facts, patterns, and language from massive text corpora |
| 9 | +2. **Finetuning**: Learning behaviors through supervised or reinforcement learning |
| 10 | + |
| 11 | +System Prompt Learning introduces a third paradigm: |
| 12 | +3. **Strategy Learning**: The model learns explicit problem-solving strategies through experience, maintains them in a growing knowledge base, and applies them selectively based on problem types |
| 13 | + |
| 14 | +This approach addresses a fundamental limitation of current LLMs—their inability to learn cumulatively from experience. While LLMs can solve individual problems impressively, they typically approach each new problem from scratch rather than building on past successes. |
| 15 | + |
| 16 | +## The SPL Paradigm |
| 17 | + |
| 18 | +System Prompt Learning represents a significant shift in how LLMs approach problem-solving: |
| 19 | + |
| 20 | +- **Experience-Driven Learning**: Rather than relying solely on pretraining or supervised finetuning, SPL enables models to learn from their own problem-solving experiences |
| 21 | +- **Strategy Formalization**: The system explicitly generates, evaluates, and refines problem-solving strategies |
| 22 | +- **Performance Tracking**: SPL tracks which strategies work well for different problem types, creating a dynamic feedback loop |
| 23 | +- **Selective Application**: When faced with a new problem, the system selects the most relevant strategies based on similarity and past performance |
| 24 | + |
| 25 | +This approach mirrors how human experts develop expertise—by accumulating strategies through experience and applying them selectively to new situations. |
| 26 | + |
| 27 | +## Experimental Results |
| 28 | + |
| 29 | +We conducted extensive experiments using the SPL plugin with gemini-2.0-flash-lite on various benchmarks. The learning phase used the OptILLMBench training split (400 instances), while evaluation was performed on the test split (100 instances) and additional popular mathematical benchmarks. |
| 30 | + |
| 31 | +The results demonstrate consistent improvements across all benchmarks: |
| 32 | + |
| 33 | +| Benchmark | Baseline | With SPL | Improvement | |
| 34 | +|-----------|----------|----------|-------------| |
| 35 | +| OptILLMBench | 61% | 65% | +4% | |
| 36 | +| MATH-500 | 85% | 85.6% | +0.6% | |
| 37 | +| Arena Auto Hard | 29% | 37.6% | +8.6% | |
| 38 | +| AIME24 | 23.33% | 30% | +6.67% | |
| 39 | + |
| 40 | +These results are particularly notable for the challenging Arena Auto Hard and AIME24 benchmarks, where traditional approaches often struggle. The improvements suggest that SPL is especially effective for complex problem-solving tasks that benefit from strategic approaches. |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | +*Figure 1: Performance comparison between baseline gemini-2.0-flash-lite and the same model with SPL across multiple mathematical benchmarks.* |
| 45 | + |
| 46 | +## Usage |
| 47 | + |
| 48 | +### Basic Usage |
| 49 | + |
| 50 | +Use the plugin by prefixing your model name with `spl-`: |
| 51 | + |
| 52 | +``` |
| 53 | +spl-gpt-4o |
| 54 | +``` |
| 55 | + |
| 56 | +### Combining with Other Plugins |
| 57 | + |
| 58 | +SPL can be combined with other plugins using the `&` operator: |
| 59 | + |
| 60 | +``` |
| 61 | +spl&memory-gpt-4o |
| 62 | +``` |
| 63 | + |
| 64 | +### Learning Mode |
| 65 | + |
| 66 | +By default, the plugin runs in inference-only mode, which uses existing strategies without creating or modifying them. To enable learning mode, which allows the plugin to create and refine strategies based on usage, add the `spl_learning` parameter to the request config: |
| 67 | + |
| 68 | +```python |
| 69 | +client.chat.completions.create( |
| 70 | + model="spl-gpt-4o", |
| 71 | + messages=[ |
| 72 | + {"role": "system", "content": system_prompt}, |
| 73 | + {"role": "user", "content": query} |
| 74 | + ], |
| 75 | + extra_body= {"spl_learning": True}, |
| 76 | +) |
| 77 | +``` |
| 78 | + |
| 79 | +## How It Works |
| 80 | + |
| 81 | +1. **Problem Classification**: The plugin analyzes each query to determine its problem type |
| 82 | +2. **Strategy Selection**: It selects relevant strategies from its database based on the problem type and content |
| 83 | +3. **System Prompt Augmentation**: Selected strategies (up to MAX_STRATEGIES_FOR_INFERENCE) are added to the system prompt |
| 84 | + |
| 85 | +When learning mode is enabled, the plugin also performs: |
| 86 | + |
| 87 | +4. **Effectiveness Evaluation**: After generating a response, the system evaluates how well each strategy worked |
| 88 | +5. **Strategy Creation & Refinement**: The system creates new strategies for unseen problem types and periodically refines existing strategies based on usage |
| 89 | + |
| 90 | +The plugin maintains two separate limits: |
| 91 | +- **Storage Limit** (MAX_STRATEGIES_PER_TYPE): Controls how many strategies can be stored in the database per problem type |
| 92 | +- **Inference Limit** (MAX_STRATEGIES_FOR_INFERENCE): Controls how many strategies are used during inference for system prompt augmentation |
| 93 | + |
| 94 | + |
| 95 | + |
| 96 | +*Figure 2: The SPL learning and inference workflow showing how strategies are learned, refined, and applied.* |
| 97 | + |
| 98 | +## Learning Metrics |
| 99 | + |
| 100 | +After training on the OptILLMBench dataset, the system developed a rich knowledge base of strategies: |
| 101 | + |
| 102 | +- **Total queries processed**: 500 |
| 103 | +- **Strategies created**: 129 |
| 104 | +- **Strategies refined**: 97 |
| 105 | +- **Successful resolutions**: 346 |
| 106 | +- **Strategies merged**: 28 |
| 107 | + |
| 108 | +These metrics indicate a healthy learning process with a balance between creation, refinement, and merging of similar strategies. |
| 109 | + |
| 110 | +## Data Storage |
| 111 | + |
| 112 | +Strategies are stored in JSON format in the `spl_data` directory: |
| 113 | +- `strategies.json`: Contains all learned strategies |
| 114 | +- `metrics.json`: Contains performance metrics and usage statistics |
| 115 | + |
| 116 | +## Configuration |
| 117 | + |
| 118 | +The SPL plugin maintains these core files: |
| 119 | +- **Strategy Database**: `/optillm/plugins/spl/data/strategies.json` |
| 120 | +- **Metrics**: `/optillm/plugins/spl/data/metrics.json` |
| 121 | + |
| 122 | +You can: |
| 123 | +1. Backup these files to preserve learned strategies |
| 124 | +2. Edit the strategies.json file to manually add or modify strategies |
| 125 | +3. Reset the learning by deleting these files (they will be recreated) |
| 126 | + |
| 127 | +## Example Strategy |
| 128 | + |
| 129 | +Below is an example of a strategy learned by the system for word problems: |
| 130 | + |
| 131 | +```json |
| 132 | +{ |
| 133 | + "strategy_id": "strategy_3", |
| 134 | + "problem_type": "word_problem", |
| 135 | + "strategy_text": "**Refined Strategy for Solving Word Problems:**\n\n1. **Understand:**\n * Read the problem carefully (multiple times).\n * Identify the question (what are you trying to find?).\n * List all given information (facts, numbers, units).\n * Clarify ambiguous terms/units.\n\n2. **Organize Information & Identify Unknowns:**\n * Choose an organization method: (e.g., table, diagram, list, drawing).\n * Clearly identify the unknowns (what you need to solve for).\n\n3. **Plan and Translate:**\n * Define *all* variables with units (e.g., `p = number of pennies`, `c = number of compartments`).\n * Identify relationships between knowns and unknowns.\n * Convert units if necessary.\n * Write equations or expressions, including units, that relate the knowns and unknowns.\n * Ensure units are consistent throughout the equations.\n * Outline the solution steps.\n\n4. **Solve:**\n * Show work step-by-step.\n * Track units throughout calculations.\n * Calculate accurately.\n * Solve for the unknowns.\n\n5. **Evaluate and Verify:**\n * Check if the answer is reasonable.\n * Verify the answer.\n\n6. **Summarize:**\n * State the answer with units.", |
| 136 | + "success_count": 85, |
| 137 | + "total_attempts": 192, |
| 138 | + "confidence": 0.425 |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +This strategy was developed through multiple refinement cycles and has a success rate of 44.3% (85/192). The system continuously updates these metrics as the strategy is applied to new problems. |
| 143 | + |
| 144 | +## Motivations and Broader Impact |
| 145 | + |
| 146 | +### The System Prompt Gap |
| 147 | + |
| 148 | +Most LLM providers like Anthropic (Claude) and OpenAI (GPT) employ elaborate system prompts that encode sophisticated problem-solving strategies. However, the majority of users interact with these models using very basic or empty system prompts, missing out on the benefits of strategic guidance. |
| 149 | + |
| 150 | +SPL bridges this gap by automatically learning and applying effective strategies, democratizing access to the benefits of well-crafted system prompts without requiring expertise in prompt engineering. |
| 151 | + |
| 152 | +### Learning from Experience |
| 153 | + |
| 154 | +Current LLMs are often described as "one-shot learners"—they can solve individual problems but don't accumulate knowledge from these experiences. SPL represents a step toward models that improve through use, similar to how humans develop expertise through practice and reflection. |
| 155 | + |
| 156 | +### Human-Readable Learning |
| 157 | + |
| 158 | +Unlike black-box learning approaches, SPL produces human-readable strategies that can be inspected, understood, and even manually edited. This transparency allows for: |
| 159 | +- Understanding how the model approaches different problems |
| 160 | +- Identifying potential biases or flaws in reasoning |
| 161 | +- Transferring strategies between models or domains |
| 162 | + |
| 163 | +## Benefits |
| 164 | + |
| 165 | +1. **Cumulative Learning**: The LLM improves on specific problem types over time |
| 166 | +2. **Explicit Knowledge**: Strategies are human-readable and provide insight into the LLM's reasoning |
| 167 | +3. **Efficiency**: Reuses successful approaches rather than solving each problem from scratch |
| 168 | +4. **Adaptability**: Different strategies for different problem types |
| 169 | +5. **Transparency**: Learning process and outcomes can be inspected and understood |
| 170 | + |
| 171 | +## Conclusion and Future Work |
| 172 | + |
| 173 | +System Prompt Learning represents a promising new direction for enabling LLMs to learn from experience in a transparent and interpretable way. Our experiments demonstrate significant performance improvements across multiple benchmarks, particularly for complex problem-solving tasks. |
| 174 | + |
| 175 | +Future work will focus on: |
| 176 | +1. Expanding the range of problem types the system can recognize |
| 177 | +2. Improving the strategy refinement process |
| 178 | +3. Enabling cross-domain strategy transfer |
| 179 | +4. Developing mechanisms for human feedback on strategies |
| 180 | +5. Exploring hybrid approaches that combine SPL with other learning paradigms |
0 commit comments