WIP Refactor tools to better detect/correct hallucinations within code blocks #51

nelsonwittwer · 2025-06-27T13:12:35Z

Renames/refactors tools to better ensure generated code blocks are free of halluciation/error.

Context

Hallucinations within generated codeblocks that interact with Shopify's API might be the biggest risk of AI. To better mitigate against hallucinated endpoints, types, fields, etc. This PR introduces a new framework to enable validation tools for each of Shopify's APIs.

Validation Tools

Tools are MCP primitives. We plan on creating tools that are designed to validate code blocks for each Shopify API. These validation tools will have the prefix of validate_#{api_name}_codeblocks

Validation Functions

Ideally these are deterministic (always the same output given the same input) to counter balance the undertermistic nature of LLMs.

Admin GQL Proof of concept

This PR introduces a validation tool to parse GQL codeblocks with the admin GQL schema and give validation skip, pass, or fail feedback to the LLM.

Giant refactor broken up in these PRs:

Stacked PRs to enable this new validation structure:

(this PR) Implements this validation tool/pattern for the Admin GQL workflow
Refactor tools/index.ts to extract HTTP calls into their own file. shopifyTools is way too big of a function and needs use some composition as this function will only grow as we add tools.
Track tool usage with a conversationId
Rename tools to be more descriptive
Better composition for validation functions

Results

The following evals were run On the last PR in the ^ list demonstrating all refactors work as intended:

The tests for the blow asserts the following:

an agent using our MCP server can iterate to passing or skipped results
it detected calls to the validate_admingql_codeblocks tool

note -- we don't test the validity details of the tools with this eval framework. That is covered extensively with unit tests for each validation function/tool. We just care the tool was called as expected and the LLM can take the output of those tools to make the necessary corrections.

With thinking models

Run with GPT o3 - This infrastructure did incredibly well with a thinking model 😎

Comprehensive suite (644 prompts)

Golden suite (20 prompts)

With non-thinking models

Run with GPT 4.1 - These results were far worse at calling validation tools like we expect. This doesn't mean that we have invalid code blocks, its that we don't have the assurance LLM validated/corrected them.

Despite very explicit and triplicated instructions, 4.1 only called validation functions 65% of the time.

What we ultimately care about in a development/regression setting is how accurate are the codeblocks 4.1 generate our MCP server. I wrote a custom promptfoo assertion that evaluates codeblocks accuracy. This assertion passes if 4.1 uses validation functions within 3 agent loops OR if the LLM didn't invoke the validation, we run the same validation logic in the assertion. While I'm not happy with how infrequent 4.1 calls our validation tools as we expect, the model does generate code blocks that are valid 85% of the time:

Comprehensive suite (644 prompts)

Golden suite (20 prompts)

matteodepalo

A couple of comments but the code looks good!

matteodepalo · 2025-07-07T15:53:56Z

src/validations/graphqlSchema.ts

+  return SCHEMA_MAPPINGS[schemaName];
+}
+
+function extractGraphQLOperation(markdownCodeBlock: string): string | null {


Can we make the LLM do the extraction? In general I'd avoid having tools that expect markdown as input and would rather leverage the LLM capabilities to pass code straight into the input so that the tools can be cleaner.

matteodepalo · 2025-07-07T15:55:52Z

src/tools/index.ts

@@ -256,6 +259,54 @@ export async function shopifyTools(server: McpServer): Promise<void> {
    },
  );

+  server.tool(
+    "validate_admin_api_codeblocks",


Could we rename the tool to validate_graphql? I had a PR to add exactly this tool here. We could then have api_surface and version as params to this tool if we want. Having one specific tool for API surface is going to lead to tool explosion and Cursor has already a low limit on the number of tools it can use.

…so it receives raw graphql operations

…ns when necessary

Arkham

LGTM

nelsonwittwer added 3 commits June 27, 2025 06:30

validation function for graphql codeblocks

06e075d

validate_admin_api_codeblocks tool

4abcfcc

delegate parsing logic to package

4d1bce9

nelsonwittwer requested a review from a team as a code owner June 27, 2025 13:12

nelsonwittwer changed the title ~~WIP - Validate admin gql~~ Refactor tools to better detect/correct hallucinations within code blocks Jul 1, 2025

nelsonwittwer changed the title ~~Refactor tools to better detect/correct hallucinations within code blocks~~ WIP Refactor tools to better detect/correct hallucinations within code blocks Jul 1, 2025

billfienberg mentioned this pull request Jul 3, 2025

Validate Polaris Web Components for App Home Surface #52

Draft

nelsonwittwer added 2 commits July 3, 2025 07:58

Setup GQL validaiotn to work with many schemas in the future

ffa7d31

merge conflicts

9ce5ca1

nelsonwittwer force-pushed the validate_admin_gql branch 2 times, most recently from e25400f to 9ce5ca1 Compare July 3, 2025 18:24

matteodepalo reviewed Jul 7, 2025

View reviewed changes

Arkham added 5 commits July 9, 2025 11:00

Rename validate_admin_api_codeblocks to validate_graphql and make it …

4f5d91d

…so it receives raw graphql operations

Remove ValidationResult.SKIPPED, the LLM should only invoke validatio…

8131585

…ns when necessary

move helper to validations index

4ebc4a8

Remove useless tests

fe05c24

Remove unused imports and add vscode setting

5ddceea

Arkham approved these changes Jul 9, 2025

View reviewed changes

Arkham merged commit 76ed940 into main Jul 9, 2025
4 checks passed

github-actions bot mentioned this pull request Aug 14, 2025

Version Packages #96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP Refactor tools to better detect/correct hallucinations within code blocks #51

WIP Refactor tools to better detect/correct hallucinations within code blocks #51

Uh oh!

nelsonwittwer commented Jun 27, 2025 •

edited

Loading

Uh oh!

matteodepalo left a comment

Uh oh!

matteodepalo Jul 7, 2025

Uh oh!

matteodepalo Jul 7, 2025

Uh oh!

Arkham left a comment

Uh oh!

Uh oh!

Uh oh!

WIP Refactor tools to better detect/correct hallucinations within code blocks #51

WIP Refactor tools to better detect/correct hallucinations within code blocks #51

Uh oh!

Conversation

nelsonwittwer commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Validation Tools

Validation Functions

Admin GQL Proof of concept

Giant refactor broken up in these PRs:

Results

With thinking models

Comprehensive suite (644 prompts)

Golden suite (20 prompts)

With non-thinking models

Comprehensive suite (644 prompts)

Golden suite (20 prompts)

Uh oh!

matteodepalo left a comment

Choose a reason for hiding this comment

Uh oh!

matteodepalo Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

matteodepalo Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Arkham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nelsonwittwer commented Jun 27, 2025 •

edited

Loading