-
Notifications
You must be signed in to change notification settings - Fork 54
WIP Refactor tools to better detect/correct hallucinations within code blocks #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e25400f
to
9ce5ca1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments but the code looks good!
src/validations/graphqlSchema.ts
Outdated
return SCHEMA_MAPPINGS[schemaName]; | ||
} | ||
|
||
function extractGraphQLOperation(markdownCodeBlock: string): string | null { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the LLM do the extraction? In general I'd avoid having tools that expect markdown as input and would rather leverage the LLM capabilities to pass code straight into the input so that the tools can be cleaner.
src/tools/index.ts
Outdated
@@ -256,6 +259,54 @@ export async function shopifyTools(server: McpServer): Promise<void> { | |||
}, | |||
); | |||
|
|||
server.tool( | |||
"validate_admin_api_codeblocks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename the tool to validate_graphql
? I had a PR to add exactly this tool here. We could then have api_surface
and version
as params to this tool if we want. Having one specific tool for API surface is going to lead to tool explosion and Cursor has already a low limit on the number of tools it can use.
…so it receives raw graphql operations
…ns when necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Renames/refactors tools to better ensure generated code blocks are free of halluciation/error.
Context
Hallucinations within generated codeblocks that interact with Shopify's API might be the biggest risk of AI. To better mitigate against hallucinated endpoints, types, fields, etc. This PR introduces a new framework to enable validation tools for each of Shopify's APIs.
Validation Tools
Tools are MCP primitives. We plan on creating tools that are designed to validate code blocks for each Shopify API. These validation tools will have the prefix of
validate_#{api_name}_codeblocks
Validation Functions
Ideally these are deterministic (always the same output given the same input) to counter balance the undertermistic nature of LLMs.
Admin GQL Proof of concept
This PR introduces a validation tool to parse GQL codeblocks with the admin GQL schema and give validation skip, pass, or fail feedback to the LLM.
Giant refactor broken up in these PRs:
Stacked PRs to enable this new validation structure:
tools/index.ts
to extract HTTP calls into their own file.shopifyTools
is way too big of a function and needs use some composition as this function will only grow as we add tools.Results
The following evals were run On the last PR in the ^ list demonstrating all refactors work as intended:
The tests for the blow asserts the following:
validate_admingql_codeblocks
toolnote -- we don't test the validity details of the tools with this eval framework. That is covered extensively with unit tests for each validation function/tool. We just care the tool was called as expected and the LLM can take the output of those tools to make the necessary corrections.
With thinking models
Run with GPT o3 - This infrastructure did incredibly well with a thinking model 😎
Comprehensive suite (644 prompts)
Golden suite (20 prompts)
With non-thinking models
Run with GPT 4.1 - These results were far worse at calling validation tools like we expect. This doesn't mean that we have invalid code blocks, its that we don't have the assurance LLM validated/corrected them.
Despite very explicit and triplicated instructions, 4.1 only called validation functions 65% of the time.
What we ultimately care about in a development/regression setting is how accurate are the codeblocks 4.1 generate our MCP server. I wrote a custom promptfoo assertion that evaluates codeblocks accuracy. This assertion passes if 4.1 uses validation functions within 3 agent loops OR if the LLM didn't invoke the validation, we run the same validation logic in the assertion. While I'm not happy with how infrequent 4.1 calls our validation tools as we expect, the model does generate code blocks that are valid 85% of the time:
Comprehensive suite (644 prompts)
Golden suite (20 prompts)