Skip to content

WIP Refactor tools to better detect/correct hallucinations within code blocks #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 9, 2025

Conversation

nelsonwittwer
Copy link
Contributor

@nelsonwittwer nelsonwittwer commented Jun 27, 2025

Renames/refactors tools to better ensure generated code blocks are free of halluciation/error.

Context

Hallucinations within generated codeblocks that interact with Shopify's API might be the biggest risk of AI. To better mitigate against hallucinated endpoints, types, fields, etc. This PR introduces a new framework to enable validation tools for each of Shopify's APIs.

Validation Tools

Tools are MCP primitives. We plan on creating tools that are designed to validate code blocks for each Shopify API. These validation tools will have the prefix of validate_#{api_name}_codeblocks

Validation Functions

Ideally these are deterministic (always the same output given the same input) to counter balance the undertermistic nature of LLMs.

Admin GQL Proof of concept

This PR introduces a validation tool to parse GQL codeblocks with the admin GQL schema and give validation skip, pass, or fail feedback to the LLM.

Giant refactor broken up in these PRs:

Stacked PRs to enable this new validation structure:

  1. (this PR) Implements this validation tool/pattern for the Admin GQL workflow
  2. Refactor tools/index.ts to extract HTTP calls into their own file. shopifyTools is way too big of a function and needs use some composition as this function will only grow as we add tools.
  3. Track tool usage with a conversationId
  4. Rename tools to be more descriptive
  5. Better composition for validation functions

Results

The following evals were run On the last PR in the ^ list demonstrating all refactors work as intended:

The tests for the blow asserts the following:

  1. an agent using our MCP server can iterate to passing or skipped results
  2. it detected calls to the validate_admingql_codeblocks tool

note -- we don't test the validity details of the tools with this eval framework. That is covered extensively with unit tests for each validation function/tool. We just care the tool was called as expected and the LLM can take the output of those tools to make the necessary corrections.

With thinking models

Run with GPT o3 - This infrastructure did incredibly well with a thinking model 😎

Comprehensive suite (644 prompts)

Screenshot 2025-07-02 at 11 30 15 AM

Golden suite (20 prompts)

Screenshot 2025-07-02 at 10 23 18 AM

With non-thinking models

Run with GPT 4.1 - These results were far worse at calling validation tools like we expect. This doesn't mean that we have invalid code blocks, its that we don't have the assurance LLM validated/corrected them.

Despite very explicit and triplicated instructions, 4.1 only called validation functions 65% of the time.

What we ultimately care about in a development/regression setting is how accurate are the codeblocks 4.1 generate our MCP server. I wrote a custom promptfoo assertion that evaluates codeblocks accuracy. This assertion passes if 4.1 uses validation functions within 3 agent loops OR if the LLM didn't invoke the validation, we run the same validation logic in the assertion. While I'm not happy with how infrequent 4.1 calls our validation tools as we expect, the model does generate code blocks that are valid 85% of the time:

Comprehensive suite (644 prompts)

Screenshot 2025-07-02 at 9 55 06 PM

Golden suite (20 prompts)

Screenshot 2025-07-02 at 10 08 40 PM

@nelsonwittwer nelsonwittwer requested a review from a team as a code owner June 27, 2025 13:12
@nelsonwittwer nelsonwittwer changed the title WIP - Validate admin gql Refactor tools to better detect/correct hallucinations within code blocks Jul 1, 2025
@nelsonwittwer nelsonwittwer changed the title Refactor tools to better detect/correct hallucinations within code blocks WIP Refactor tools to better detect/correct hallucinations within code blocks Jul 1, 2025
@nelsonwittwer nelsonwittwer force-pushed the validate_admin_gql branch 2 times, most recently from e25400f to 9ce5ca1 Compare July 3, 2025 18:24
Copy link
Contributor

@matteodepalo matteodepalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments but the code looks good!

return SCHEMA_MAPPINGS[schemaName];
}

function extractGraphQLOperation(markdownCodeBlock: string): string | null {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the LLM do the extraction? In general I'd avoid having tools that expect markdown as input and would rather leverage the LLM capabilities to pass code straight into the input so that the tools can be cleaner.

@@ -256,6 +259,54 @@ export async function shopifyTools(server: McpServer): Promise<void> {
},
);

server.tool(
"validate_admin_api_codeblocks",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename the tool to validate_graphql? I had a PR to add exactly this tool here. We could then have api_surface and version as params to this tool if we want. Having one specific tool for API surface is going to lead to tool explosion and Cursor has already a low limit on the number of tools it can use.

Copy link
Contributor

@Arkham Arkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Arkham Arkham merged commit 76ed940 into main Jul 9, 2025
4 checks passed
@github-actions github-actions bot mentioned this pull request Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants