Speeding up several PyDough tests #376

knassre-bodo · 2025-06-19T16:52:21Z

To add minor incremental speedups to the PyDough e2e tests, rewrites several of the PyDough testing functions (particularly the ones used by the TPC-H custom pipeline tests) to run faster without really changing the core of what the question is doing (usually by adding filters to make scans of huge tables smaller before aggregations/joins/window funcitons). The modified tests were identified by running pdunit -m "execute" --durations=100 to list out the 100 runtime tests with the longest runtime, amongst which some of the worst offenders (outside the main TPC-H queries) were shortened. Most of the affected tests were reduced from 2-12 seconds to 1-4 seconds when run locally. Accounting for the longer times in the Github Actions, and the double duration for 3.10/3.11, this adds up to shortening the time for a CI run to complete by about a minute (12.5 minutes to 11.5 minutes).

…s/documentation/annotations

Resolves #361.

Co-authored-by: Hadia Ahmed <[email protected]>

…agg_multi_split

knassre-bodo · 2025-06-19T17:33:32Z

tests/test_plan_refsols/aggregation_analytics_1.txt

+        JOIN(condition=t0.supplier_key == t1.key, type=INNER, cardinality=SINGULAR_FILTER, columns={'part_key': t0.part_key, 'ps_supply_cost': t0.ps_supply_cost, 'supplier_key': t0.supplier_key})
+         SCAN(table=tpch.PARTSUPP, columns={'part_key': ps_partkey, 'ps_supply_cost': ps_supplycost, 'supplier_key': ps_suppkey})
+         FILTER(condition=name == 'Supplier#000009450':string, columns={'key': key})
+          SCAN(table=tpch.SUPPLIER, columns={'key': s_suppkey, 'name': s_name})


This is faster because now it is using the join with the filtered version of suppliers to aggressively filter all of the data before aggregation.

knassre-bodo · 2025-06-19T17:34:40Z

tests/test_plan_refsols/avg_order_diff_per_customer.txt

+     FILTER(condition=order_priority == '1-URGENT':string, columns={'customer_key': customer_key, 'order_date': order_date})
+      SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'order_date': o_orderdate, 'order_priority': o_orderpriority})


These extra filters change the answer slightly, but also make it run much faster since we avoid massive scans feeding into aggregates/joins/window functions.

knassre-bodo · 2025-06-19T17:35:31Z

tests/test_plan_refsols/customer_largest_order_deltas.txt

+       JOIN(condition=t0.key == t1.order_key, type=LEFT, cardinality=SINGULAR_ACCESS, columns={'agg_0': t1.agg_0, 'customer_key': t0.customer_key, 'order_date': t0.order_date})
+        FILTER(condition=YEAR(order_date) == 1994:numeric, columns={'customer_key': customer_key, 'key': key, 'order_date': order_date})
+         SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'key': o_orderkey, 'order_date': o_orderdate})


Same here: massively reducing the number of rows fed into a window function.

knassre-bodo · 2025-06-19T17:36:11Z

tests/test_plan_refsols/singular7.txt

+        FILTER(condition=brand == 'Brand#13':string, columns={'key': key, 'name': name})
+         SCAN(table=tpch.PART, columns={'brand': p_brand, 'key': p_partkey, 'name': p_name})


Also reducing the number of rows fed into the window function

john-sanchez31

Looks good, just a question below

john-sanchez31 · 2025-06-19T23:06:38Z

tests/test_pipeline_tpch_custom.py

                        ],
-                        "avg_diff": [2195.0, 1998.0, 1995.0, 1863.0, 1787.0],
+                        "avg_diff": [2361.0, 2260.0, 2309.0, 2269.0, 2309.0],


Why did the expected values in this file change?

He updated the actual query and now "Only consider Japanese customers and urgent orders." so this changes the expected results.

Because the change to this test specifically also changed the answer to make the test run faster by having it compute the result from a smaller set of rows.

hadia206

Looks good to me.
Thanks Kian

knassre-bodo added 30 commits May 1, 2025 13:24

Converted DEFOG and EPOCH

e71c201

Updating demo graph

08ebfa1

Adding more error handling

b2e0493

Updating notebooks

34fa6c8

Resolved conflicts

54c0960

Adding epoch details

882f395

Added semantic information for broker graph

8870e7d

Added more tests

e9db709

WIP defog grpahs

df23c3f

Added remaining dealership tests

946cb45

Add more invalid graph tests

eed8f2d

Updated derm treatment graph with semantic information

8932de3

WIP updating ewallet with semantic information

199e607

Fixing typo

adb63c7

Started new graph

368e0aa

Finishing defog update

697c62e

Starting to design technograph

99e68b4

Added users generation, broke into helper functions with more comment…

a8ff4f8

…s/documentation/annotations

Creating technograph analysis questions and answers

86f1faa

Resolving conflicts

17ba60d

Add more technograph tests

80ea4df

Updated metadata visualizer enough

7969f24

Updating readmes

6f49ae9

Added to metadata example

c42bd38

Revisions [RUN CI]

84c3405

Fixing error in tpch_q3 test and adidng date/datetime sqlite3 adaptors

1b4bbf6

[RUN CI]

86dd995

Minor typo fixes/cleanup

22930e8

Tinkering with some of the time-series data

406cae9

Added cartesian product use case [RUN CI]

a4b8f1b

knassre-bodo and others added 18 commits June 13, 2025 16:35

Fixing de-correlation bug with correl_18

467c286

Fixing multi partition 6 [RUN CI]

265896a

Merge branch 'main' into kian/agg_multi_split

b714110

Updating CROSS plans [RUN CI]

e4a4cca

Enhancing cardinality rules

e7284f8

Adjusting filter pushdown handling

cc01bbe

[RUN CI]

b7b8032

Fixing bug with cross of partition nodes (#367)

23d6f99

Resolves #361.

Update pydough/relational/rel_util.py

e5943b4

Co-authored-by: Hadia Ahmed <[email protected]>

Revisions

1f6c040

[RUN CI]

9f1b7b6

Merge branch 'main' into kian/agg_multi_split

22bc7ce

Merge remote-tracking branch 'origin/kian/agg_multi_split' into kian/…

09fe4d5

…agg_multi_split

[RUN CI]

18f12aa

Updating tests [RUN CI]

eb91c2c

Speeding up agg analytics tests

28660ae

Revising agg analytics tests

b98ddea

Speeding up more tests [RUN CI]

7562f05

knassre-bodo added the testing Alters the testing/CI process for PyDough label Jun 19, 2025

Updating tests [RUN CI]

ce5640e

knassre-bodo requested review from hadia206 and john-sanchez31 June 19, 2025 17:31

knassre-bodo commented Jun 19, 2025

View reviewed changes

john-sanchez31 approved these changes Jun 19, 2025

View reviewed changes

hadia206 approved these changes Jun 19, 2025

View reviewed changes

knassre-bodo merged commit ae147ce into main Jun 20, 2025
5 checks passed

knassre-bodo deleted the kian/test_speedup branch June 20, 2025 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speeding up several PyDough tests #376

Speeding up several PyDough tests #376

Uh oh!

knassre-bodo commented Jun 19, 2025 •

edited

Loading

Uh oh!

knassre-bodo Jun 19, 2025

Uh oh!

knassre-bodo Jun 19, 2025

Uh oh!

knassre-bodo Jun 19, 2025

Uh oh!

knassre-bodo Jun 19, 2025

Uh oh!

john-sanchez31 left a comment

Uh oh!

john-sanchez31 Jun 19, 2025

Uh oh!

hadia206 Jun 19, 2025

Uh oh!

knassre-bodo Jun 20, 2025

Uh oh!

hadia206 left a comment

Uh oh!

Uh oh!

Uh oh!

		FILTER(condition=order_priority == '1-URGENT':string, columns={'customer_key': customer_key, 'order_date': order_date})
		SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'order_date': o_orderdate, 'order_priority': o_orderpriority})

		FILTER(condition=brand == 'Brand#13':string, columns={'key': key, 'name': name})
		SCAN(table=tpch.PART, columns={'brand': p_brand, 'key': p_partkey, 'name': p_name})

Speeding up several PyDough tests #376

Speeding up several PyDough tests #376

Uh oh!

Conversation

knassre-bodo commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knassre-bodo Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

hadia206 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

knassre-bodo commented Jun 19, 2025 •

edited

Loading