perf: optimize context extraction pipeline (~2x speedup) by KRRT7 · Pull Request #1920 · codeflash-ai/codeflash

KRRT7 · 2026-03-27T19:32:11Z

Summary

Pre-compute base_defs once per file in the second loop of extract_all_contexts_from_files, passing them to remove_unused_definitions_by_function_names to skip redundant CST traversals (1.83s → 0.01s)
Skip FutureAliasedImportTransformer via a fast _has_aliased_future_imports check when no aliased __future__ imports exist (140ms → 0.02ms/call)
Replace MetadataWrapper + ParentNodeProvider in DependencyCollector with lightweight visit_Attribute/leave_Attribute id-tracking, eliminating expensive metadata computation (5.7x per-call speedup)

Details

extract_all_contexts_from_files runs two loops over helper files. The first loop already pre-computes base_defs and passes defs_with_usages to avoid re-traversal — but the second loop (HoH-only files) was calling remove_unused_definitions_by_function_names without it, forcing 5 redundant MetadataWrapper + DependencyCollector traversals.

DependencyCollector required MetadataWrapper solely for ParentNodeProvider in one place: checking if a Name is the .attr part of an Attribute inside class bodies. An id-based set populated by visit_Attribute/leave_Attribute replaces this with zero metadata overhead.

FutureAliasedImportTransformer traversed the full CST on every call even when no aliased __future__ imports existed (the common case). A fast O(imports) check short-circuits the traversal.

Benchmark

Profiled via cProfile on test_benchmark_extract (Python 3.13.7, macOS):

Function	Before	After
`remove_unused_definitions_by_function_names`	1.83s (14.7%)	0.01s
`collect_top_level_defs_with_dependencies`	38.9% cumtime	10.3% cumtime
`gather_source_imports`	10.7% cumtime	2.5% cumtime

Remaining cost dominated by libcst transform_module (43%) and Jedi inference (22%) — external library internals.

Test plan

226 context extraction + static analysis tests pass
Micro-benchmark verified DependencyCollector produces identical results across 6 project files (352 definitions)
uv run prek clean

Eliminate redundant CST traversals in code context extraction by caching dependency data, skipping unnecessary transforms, and removing MetadataWrapper.

claude · 2026-03-27T19:32:42Z

Claude finished @KRRT7's task in 3m 27s —— View job

PR Review Summary

Prek Checks

Clean — uv run prek run --from-ref origin/main passes with no issues.

Mypy errors in unused_definition_remover.py and code_extractor.py are pre-existing (outside the changed lines and present on origin/main).

Code Review

Size: SMALL (44 additions / 19 deletions across 3 files)

No bugs, security issues, or breaking API changes found. The three optimizations are all correct:

1. DependencyCollector — id-based attribute tracking (unused_definition_remover.py:284-312)

The visit_Attribute/leave_Attribute id-tracking replacement for ParentNodeProvider is sound. libcst traverses depth-first, so visit_Attribute fires before children are visited, which means id(node.attr) is in attr_name_ids by the time visit_Name is called for the .attr child. leave_Attribute then correctly discards it after the subtree is done.

One minor note: leave_Attribute receives original_node, and since DependencyCollector is a CSTVisitor (not a transformer), original_node is the same object as was visited, so id(original_node.attr) == id(node.attr) holds. No issue here.

2. _has_aliased_future_imports fast-path (code_extractor.py:429-447)

The check correctly short-circuits FutureAliasedImportTransformer when no aliased __future__ imports exist. The condition aligns with what the transformer actually modifies.

Minor: isinstance(s.names, (list, tuple)) works in practice since libcst uses tuples, but not isinstance(s.names, cst.ImportStar) would be more semantically precise. Not a bug.

3. Pre-compute base_defs in the second loop (code_context_extractor.py:399-401)

Consistent with the pattern already used in the first loop (lines 273–274). mark_defs_for_functions creates a fresh copy so base_defs is never mutated between calls. Correct.

Duplicate Detection

No duplicates detected. The new helper functions (_has_aliased_future_imports, _strip_future_aliases) are module-private and have no counterparts elsewhere.

Test Coverage

Skipped for SMALL PRs. PR author reports 226 context extraction + static analysis tests pass.

Open codeflash-ai[bot] PRs

Three open bot PRs (#1890, #1891, #1895) all target PR #1887's branch, not main. All have the same broad CI failures (unit-tests across all Python versions, js/tracer tests). This pattern indicates pre-existing failures on the base branch, not regressions from the bot PRs themselves. Leaving open.

Last updated: 2026-03-27T21:00 UTC

cst.Attribute branch was dead code since __future__ imports always use a plain Name node.

perf: optimize context extraction pipeline (~2x speedup)

8a07c5e

Eliminate redundant CST traversals in code context extraction by caching dependency data, skipping unnecessary transforms, and removing MetadataWrapper.

KRRT7 added 2 commits March 27, 2026 15:56

fix: use tuple syntax for isinstance check (Python 3.9 compat)

f180c3f

refactor: simplify _has_aliased_future_imports check

03e2fff

cst.Attribute branch was dead code since __future__ imports always use a plain Name node.

KRRT7 added a commit that referenced this pull request Mar 27, 2026

docs: add benchmark comparison image for PRs #1920/#1921

85eafaa

KRRT7 mentioned this pull request Mar 27, 2026

perf: optimize context extraction with caching and import pre-filtering #1921

Merged

2 tasks

KRRT7 merged commit fb1381e into main Mar 27, 2026
26 of 27 checks passed

KRRT7 deleted the cf-cpu-context-extraction branch March 27, 2026 22:37

KRRT7 mentioned this pull request Mar 27, 2026

feat: add codeflash compare CLI command #1922

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimize context extraction pipeline (~2x speedup)#1920

perf: optimize context extraction pipeline (~2x speedup)#1920
KRRT7 merged 3 commits into
mainfrom
cf-cpu-context-extraction

KRRT7 commented Mar 27, 2026

Uh oh!

claude Bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KRRT7 commented Mar 27, 2026

Summary

Details

Benchmark

Test plan

Uh oh!

claude Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Summary

Prek Checks

Code Review

Duplicate Detection

Test Coverage

Open codeflash-ai[bot] PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Mar 27, 2026 •

edited

Loading