Strengthen anomaly detector edge coverage and remove order-dependent test behavior by Copilot · Pull Request #41916 · github/gh-aw

Copilot · 2026-06-27T18:39:33Z

pkg/agentdrain/anomaly_test.go was missing key boundary coverage around rareThreshold and had an order-dependent TestAnalyzeEvent structure that could hide failures under subtest shuffling. This update closes those gaps and pins the documented Analyze invariants.

Rare-cluster boundary coverage
- Added TestAnomalyDetector_Analyze rows for:
  - cluster.Size == rareThreshold
  - cluster.Size == rareThreshold + 1
- This explicitly protects the inclusive predicate cluster.Size <= rareThreshold.
Zero-threshold edge behavior
- Added rareThreshold=0 cases for cluster.Size=0 and cluster.Size=1.
- Confirms behavior is correct at the lower bound, not just constructor acceptance.
Deterministic AnalyzeEvent progression
- Refactored TestAnalyzeEvent from shared-state subtests into a single sequential flow:
  1. first occurrence (new template)
  2. second identical occurrence (existing template)
  3. distinct event (new template)
- Removes implicit t.Run ordering dependency.
Invariant pinning for Analyze flags and score
- Added TestAnalyze_FlagMutualExclusivity to assert:
  - IsNewTemplate and LowSimilarity are never both true
  - AnomalyScore remains within [0,1]
  - nil-cluster behavior is explicit (RareCluster == false)

report := d.Analyze(tt.result, tt.isNew, tt.cluster)
assert.False(t, report.IsNewTemplate && report.LowSimilarity)
assert.GreaterOrEqual(t, report.AnomalyScore, 0.0)
assert.LessOrEqual(t, report.AnomalyScore, 1.0)

pr-sous-chef: requested branch refresh via update_branch.

Generated by 👨‍🍳 PR Sous Chef · 83 AIC · ⌖ 0.986 AIC · ⊞ 17.1K · ◷

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

github-actions · 2026-06-27T18:51:27Z

✅ PR Code Quality Reviewer completed the code quality review.

github-actions · 2026-06-27T18:51:39Z

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

github-actions · 2026-06-27T18:52:00Z

✅ Test Quality Sentinel completed test quality analysis.

Copilot

Pull request overview

This PR strengthens test coverage for the pkg/agentdrain anomaly detection logic by adding explicit boundary/edge cases for rareThreshold and making TestAnalyzeEvent deterministic by removing order-dependent subtests.

Changes:

Added table-driven Analyze test cases covering cluster.Size == rareThreshold, cluster.Size == rareThreshold+1, and rareThreshold=0 boundaries.
Refactored TestAnalyzeEvent into a single sequential progression to avoid relying on t.Run ordering.
Added an invariant test (TestAnalyze_FlagMutualExclusivity) to pin mutual exclusivity of flags and ensure AnomalyScore stays within [0,1], including explicit nil-cluster behavior.

Show a summary per file

File	Description
pkg/agentdrain/anomaly_test.go	Adds boundary/edge coverage for `Analyze`, removes order-dependent subtests in `AnalyzeEvent`, and pins key `Analyze` invariants.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 1/1 changed files
Comments generated: 0
Review effort level: Low

github-actions

Skills-Based Review 🧠

Applied /tdd — good foundational coverage additions, with a few gaps worth addressing.

📋 Key Themes & Highlights

Key Themes

MatchResult assertions missing: resultFirst/Second/Distinct are nil-checked but their fields (ClusterID, Similarity) are never verified — a regression in match data would be invisible.
TestAnalyze_FlagMutualExclusivity is incomplete as a spec: no wantIsNew field, IsNewTemplate is only tested indirectly, and score assertions are too loose ([0,1] rather than exact values).
t.Run removal tradeoff: the flat sequential structure makes the state dependency explicit (good) but loses per-step test filtering and failure attribution. The stated reason ("subtest shuffling") only applies with go test -shuffle=on, which is non-default.

Positive Highlights

✅ Excellent boundary commentary in each new table-driven case (e.g. // 2 <= 2 is true, // score = 0.3/2.0 = 0.15) — makes the predicate immediately verifiable.
✅ Correct and complete rareThreshold boundary coverage: size == threshold, size == threshold+1, threshold=0 with both size=0 and size=1.
✅ TestAnalyze_FlagMutualExclusivity is a valuable invariant test that didn't exist before; the nil-cluster case is especially useful.
✅ TestBuildReason coverage of the "all flags set" case is well-commented re: mutual exclusivity.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 104.1 AIC · ⌖ 7.51 AIC · ⊞ 6.6K

github-actions · 2026-06-27T19:01:24Z

+	// Step 1: first occurrence trains the template and is flagged as new.
+	resultFirst, reportFirst, errFirst := m.AnalyzeEvent(evtPlan)
+	require.NoError(t, errFirst, "AnalyzeEvent should not fail for first event")
+	require.NotNil(t, resultFirst, "AnalyzeEvent should return a non-nil result")


[/tdd] resultFirst (and resultSecond, resultDistinct) are only checked for nil — the actual MatchResult content (ClusterID, Similarity, template tokens) is never asserted, leaving those fields unspecified.

💡 Suggested enhancement

Pinning result fields alongside report fields fully documents the expected state after each step:

require.NotNil(t, resultFirst) assert.Equal(t, int64(1), resultFirst.ClusterID, "should match cluster 1") assert.InDelta(t, 1.0, resultFirst.Similarity, 1e-9, "first event matches its own new template exactly")

This catches regressions where Analyze returns the correct report but AnalyzeEvent silently returns mismatched match data.

@copilot please address this.

github-actions · 2026-06-27T19:01:25Z

-		assert.InDelta(t, 0.65, report.AnomalyScore, 1e-9, "AnomalyScore mismatch for first event")
-		assert.Equal(t, "new log template discovered; rare cluster (few observations)", report.Reason, "Reason mismatch for first event")
-	})
+	// Step 1: first occurrence trains the template and is flagged as new.


[/tdd] Removing t.Run loses the ability to run individual steps with go test -run TestAnalyzeEvent/step_name and weakens failure attribution in test output — a failure at Step 2 now silently skips Step 3 without naming the step.

💡 Tradeoffs and alternatives

The PR description frames this as preventing "subtest shuffling," but go test -shuffle=on is not the default — the real problem was shared miner state across subtests.

If the sequential coupling is intentional, consider a clarifying comment:

// NOTE: Steps share miner state intentionally — do not extract into parallel subtests // or run with -shuffle=on.

Alternatively, keep t.Run but document the dependency:

t.Run("first occurrence is flagged as new template", func(t *testing.T) { // miner state starts empty result, report, err := m.AnalyzeEvent(evtPlan) ... }) // Step 2 relies on Step 1 having trained the template t.Run("second identical occurrence is not new", func(t *testing.T) { ... })

This preserves per-step filtering (-run TestAnalyzeEvent/first) and makes the ordering contract visible.

@copilot please address this.

github-actions · 2026-06-27T19:01:25Z

+		result   *MatchResult
+		cluster  *Cluster
+		wantLow  bool
+		wantRare bool


[/tdd] The test struct omits a wantIsNew bool field — IsNewTemplate is only verified indirectly via the mutual-exclusivity assert on line 460, not directly per test case.

💡 Suggested enhancement

Add wantIsNew bool to the struct and assert it explicitly in the loop, making the table fully self-documenting:

tests := []struct { name string isNew bool result *MatchResult cluster *Cluster wantIsNew bool // add this wantLow bool wantRare bool }{ { name: "new template remains exclusive from low similarity", isNew: true, ... wantIsNew: true, wantLow: false, wantRare: true, }, ... } // in loop: assert.Equal(t, tt.wantIsNew, report.IsNewTemplate, "IsNewTemplate mismatch")

This catches regressions in the isNew → IsNewTemplate pass-through independently of the mutual-exclusivity check.

@copilot please address this.

github-actions · 2026-06-27T19:01:25Z

+			assert.Equal(t, tt.wantLow, report.LowSimilarity, "LowSimilarity mismatch")
+			assert.Equal(t, tt.wantRare, report.RareCluster, "RareCluster mismatch")
+			assert.False(t, report.IsNewTemplate && report.LowSimilarity, "IsNewTemplate and LowSimilarity must be mutually exclusive")
+			assert.GreaterOrEqual(t, report.AnomalyScore, 0.0, "AnomalyScore must stay within [0,1]")


[/tdd] The score assertions only check bounds [0, 1] — a silent change to the scoring weights would pass these tests without detection.

💡 Suggested enhancement

Add wantScore float64 per test case and use assert.InDelta (consistent with TestAnomalyDetector_Analyze):

// "new template remains exclusive from low similarity": // isNew=true + rare (size 1 <= rareThreshold 2) → score = (1.0+0.3)/2.0 = 0.65 wantScore: 0.65, // "existing template can have low similarity": // lowSim + rare → score = (0.7+0.3)/2.0 = 0.50 wantScore: 0.50, // "nil cluster, high similarity": // no flags → score = 0.0 wantScore: 0.0,

Then in the loop:

assert.InDelta(t, tt.wantScore, report.AnomalyScore, 1e-9, "AnomalyScore mismatch")

@copilot please address this.

github-actions · 2026-06-27T19:01:25Z

+		{
+			// rareThreshold=0 boundary: size=0 satisfies 0 <= 0.
+			// score = (rare-cluster weight 0.3) / maxScore 2.0 = 0.15
+			name:              "zero rare threshold marks zero-sized cluster as rare",


[/tdd] The Size: 0 cluster case is a useful defensive test, but it leaves readers wondering whether a zero-sized cluster is a reachable production state or purely hypothetical.

💡 Suggestion

A brief comment clarifies the intent, e.g.:

// rareThreshold=0 boundary: size=0 satisfies 0 <= 0. // Note: a cluster with Size=0 is not a normal production state (clusters grow when events are // assigned to them), but the predicate is evaluated regardless — this test pins the boundary // arithmetic, not production reachability. name: "zero rare threshold marks zero-sized cluster as rare",

@copilot please address this.

github-actions

Non-blocking: test brittleness in the refactored `TestAnalyzeEvent` flow

The boundary-coverage additions to TestAnomalyDetector_Analyze and the new TestAnalyze_FlagMutualExclusivity invariant test are solid additions. The TestAnalyzeEvent refactoring itself is directionally correct, but introduces two maintainability gaps documented in the inline comments.

Summary of findings

Medium — hardcoded score values couple to scoring-weight constants (line 393)
All three InDelta assertions bake in the current formula constants (1.0/0.3/2.0) without documenting the math. The table-driven tests above do document it inline — inconsistent standard.

Medium — step 2 score silently depends on DefaultConfig().RareClusterThreshold == 2 (line 402)
After the second AnalyzeEvent, the cluster has Size=2. Asserting AnomalyScore == 0.15 requires 2 <= rareThreshold, but rareThreshold comes opaquely from DefaultConfig(). Changing that default breaks the test with a confusing numeric mismatch rather than an obvious config-change signal.

Neither finding is a behavioral regression — documentation/maintainability concerns only.

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

patchdiff.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "patchdiff.githubusercontent.com"

See Network Configuration for more information.

🔎 Code quality review by PR Code Quality Reviewer · 99.1 AIC · ⌖ 7.25 AIC · ⊞ 5.2K

github-actions · 2026-06-27T19:01:58Z

+	require.NotNil(t, resultFirst, "AnalyzeEvent should return a non-nil result")
+	require.NotNil(t, reportFirst, "AnalyzeEvent should return a non-nil report")
+	assert.True(t, reportFirst.IsNewTemplate, "IsNewTemplate mismatch for first event")
+	assert.InDelta(t, 0.65, reportFirst.AnomalyScore, 1e-9, "AnomalyScore mismatch for first event")


Hardcoded scores white-box couple to scoring weights and DefaultConfig() internals — all three steps are fragile.

All three InDelta assertions (0.65, 0.15, 0.65) encode the production formula (weight_IsNew=1.0, weight_Rare=0.3) / maxScore=2.0 directly. If anyone retunes those weights, every assertion here fails with an opaque numeric mismatch that offers no diagnostic signal.

The table-driven tests above this function document their expected scores inline (e.g. // score = (1.0 + 0.3) / 2.0 = 0.65). This sequential flow has none of that context.

💡 Suggested fix

Add inline math comments tying each expected value to the formula:

// score = (weight_IsNewTemplate=1.0 + weight_RareCluster=0.3) / maxScore=2.0 assert.InDelta(t, 0.65, reportFirst.AnomalyScore, 1e-9, "AnomalyScore mismatch for first event")

When weights change, the reader immediately knows which constant to update rather than having to reverse-engineer 0.65.

github-actions · 2026-06-27T19:01:58Z

+	require.NotNil(t, resultSecond, "AnalyzeEvent should return a non-nil result")
+	require.NotNil(t, reportSecond, "AnalyzeEvent should return a non-nil report")
+	assert.False(t, reportSecond.IsNewTemplate, "IsNewTemplate mismatch for second identical event")
+	assert.InDelta(t, 0.15, reportSecond.AnomalyScore, 1e-9, "AnomalyScore mismatch for second identical event")


Step 2's score (0.15) silently depends on DefaultConfig().RareClusterThreshold == 2 — an undocumented load-bearing assumption.

After the first AnalyzeEvent(evtPlan) call the cluster has Size=1; after the second it becomes Size=2. The assertion AnomalyScore == 0.15 (= RareCluster weight 0.3 / maxScore 2.0) requires cluster.Size(2) <= rareThreshold(2) to be true. That's currently satisfied because DefaultConfig().RareClusterThreshold = 2, but the test has zero documentation of this dependency.

If someone changes RareClusterThreshold to 1 (a perfectly plausible 'tighten rare detection' config change), the condition becomes 2 <= 1 = false, RareCluster flips to false, and the test fails with expected 0.15 got 0.0 — a confusing numeric mismatch with no obvious explanation.

💡 Suggested fix

Verify the config value you're relying on, and document why:

// DefaultConfig().RareClusterThreshold == 2, so Size=2 (after second insert) is still rare. // score = weight_RareCluster=0.3 / maxScore=2.0 = 0.15 assert.InDelta(t, 0.15, reportSecond.AnomalyScore, 1e-9, "AnomalyScore mismatch for second identical event")

Or, if the intent is to test flag transitions rather than exact scoring, drop the numeric assertion and assert only RareCluster == true and IsNewTemplate == false.

github-actions · 2026-06-27T19:02:08Z

🧪 Test Quality Sentinel Report

✅ Test Quality Score: 100/100 — Excellent

Analyzed 7 test scenario(s) across 1 file: 7 design, 0 implementation, 0 guideline violation(s). TestAnalyzeEvent was refactored (order-dependency fix) with no new scenarios.

📊 Metrics & Test Classification (7 tests analyzed)

Metric	Value
New/modified tests analyzed	7
✅ Design tests (behavioral contracts)	7 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	7 (100%)
Duplicate test clusters	0
Test inflation detected	No
🚨 Coding-guideline violations	0 (Go mock libraries / missing build tags / no assertion messages)

Test	File	Classification	Issues Detected
`TestAnomalyDetector_Analyze` — "cluster size exactly at rare threshold is rare"	`pkg/agentdrain/anomaly_test.go:~70`	✅ Design	—
`TestAnomalyDetector_Analyze` — "cluster size just above rare threshold is not rare"	`pkg/agentdrain/anomaly_test.go:~83`	✅ Design	—
`TestAnomalyDetector_Analyze` — "zero rare threshold marks zero-sized cluster as rare"	`pkg/agentdrain/anomaly_test.go:~96`	✅ Design	—
`TestAnomalyDetector_Analyze` — "zero rare threshold does not mark size one cluster as rare"	`pkg/agentdrain/anomaly_test.go:~109`	✅ Design	—
`TestAnalyze_FlagMutualExclusivity` — "new template remains exclusive from low similarity"	`pkg/agentdrain/anomaly_test.go:~428`	✅ Design	—
`TestAnalyze_FlagMutualExclusivity` — "existing template can have low similarity"	`pkg/agentdrain/anomaly_test.go:~436`	✅ Design	—
`TestAnalyze_FlagMutualExclusivity` — "existing template with high similarity and nil cluster"	`pkg/agentdrain/anomaly_test.go:~444`	✅ Design	—

Go: 1 (*_test.go); JavaScript: 0. Other languages detected but not scored.

Notes on TestAnalyzeEvent refactor: The 3 t.Run sub-tests were replaced with sequential inline assertions to eliminate order-dependent test behavior (sub-tests could previously be run in isolation without shared state, silently producing misleading pass results). No new scenarios were added; this is counted as a quality improvement.

Verdict

✅ Check passed. 0% implementation tests (threshold: 30%). All 7 new scenarios are behavioral contracts: 4 cover precise boundary conditions on the rareThreshold ≤ comparison (exact, off-by-one, and zero-value edges), and 3 enforce the IsNewTemplate/LowSimilarity mutual-exclusivity invariant including a nil-cluster edge case. Every assertion carries a descriptive message; no mock libraries are used; //go:build !integration is present.

🧪 Test quality analysis by Test Quality Sentinel · 115.3 AIC · ⌖ 23.2 AIC · ⊞ 8.4K · ◷

github-actions

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 7 new scenarios enforce behavioral contracts with full edge case coverage.

gh-aw-bot · 2026-06-27T19:20:05Z

@copilot please run the pr-finisher skill, update this branch from main (it is behind by 1 commit), and rerun the checks. If any unresolved review feedback remains, please address it in the same pass.

Generated by 👨‍🍳 PR Sous Chef · 83 AIC · ⌖ 0.986 AIC · ⊞ 17.1K · ◷

…testify-expert-improve-test-quality Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>

Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>

Initial plan

083e966

Copilot AI assigned Copilot and pelikhan Jun 27, 2026

Copilot started work on behalf of pelikhan June 27, 2026 18:40 View session

Apply remaining changes

c301740

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Improve test quality for anomaly_test.go~~ Strengthen anomaly detector edge coverage and remove order-dependent test behavior Jun 27, 2026

Copilot finished work on behalf of pelikhan June 27, 2026 18:49

Copilot AI requested a review from pelikhan June 27, 2026 18:49

pelikhan marked this pull request as ready for review June 27, 2026 18:50

Copilot AI review requested due to automatic review settings June 27, 2026 18:50

Copilot started reviewing on behalf of pelikhan June 27, 2026 18:50 View session

Copilot AI reviewed Jun 27, 2026

View reviewed changes

github-actions Bot reviewed Jun 27, 2026

View reviewed changes

github-actions Bot approved these changes Jun 27, 2026

View reviewed changes

Merge branch 'main' into copilot/testify-expert-improve-test-quality

45e3b4c

Copilot started work on behalf of gh-aw-bot June 27, 2026 19:20 View session

Copilot AI and others added 2 commits June 27, 2026 19:26

Merge remote-tracking branch 'refs/remotes/origin/main' into copilot/…

787e7df

…testify-expert-improve-test-quality Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>

Tighten LSP duplicate-key coverage

3713e68

Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>

Copilot finished work on behalf of gh-aw-bot June 27, 2026 19:42

Copilot AI requested a review from gh-aw-bot June 27, 2026 19:42

pelikhan merged commit 019da65 into main Jun 27, 2026

pelikhan deleted the copilot/testify-expert-improve-test-quality branch June 27, 2026 20:35

github-actions Bot mentioned this pull request Jun 27, 2026

[safeoutputs] submit_pull_request_review: disambiguate event field against 'action' (recurring) — agents pass action: instead of [Content truncated due to length] #41947

Closed

Copilot AI mentioned this pull request Jun 27, 2026

safeoutputs: disambiguate submit_pull_request_review.event from action #41948

Merged

github-actions Bot mentioned this pull request Jun 28, 2026

[PR Triage Report] 🤖 PR Triage Report — 2026-06-28 (Run §28307424127) #41986

Closed

Uh oh!

Conversation

Copilot AI commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Skills-Based Review 🧠

Key Themes

Positive Highlights

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Non-blocking: test brittleness in the refactored TestAnalyzeEvent flow

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 27, 2026

🧪 Test Quality Sentinel Report

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gh-aw-bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Jun 27, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 27, 2026 •

edited

Loading

github-actions Bot commented Jun 27, 2026 •

edited

Loading

github-actions Bot commented Jun 27, 2026 •

edited

Loading

Non-blocking: test brittleness in the refactored `TestAnalyzeEvent` flow