Skip to content

Speeding up several PyDough tests #376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 254 commits into from
Jun 20, 2025
Merged

Speeding up several PyDough tests #376

merged 254 commits into from
Jun 20, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Jun 19, 2025

To add minor incremental speedups to the PyDough e2e tests, rewrites several of the PyDough testing functions (particularly the ones used by the TPC-H custom pipeline tests) to run faster without really changing the core of what the question is doing (usually by adding filters to make scans of huge tables smaller before aggregations/joins/window funcitons). The modified tests were identified by running pdunit -m "execute" --durations=100 to list out the 100 runtime tests with the longest runtime, amongst which some of the worst offenders (outside the main TPC-H queries) were shortened. Most of the affected tests were reduced from 2-12 seconds to 1-4 seconds when run locally. Accounting for the longer times in the Github Actions, and the double duration for 3.10/3.11, this adds up to shortening the time for a CI run to complete by about a minute (12.5 minutes to 11.5 minutes).

@knassre-bodo knassre-bodo added the testing Alters the testing/CI process for PyDough label Jun 19, 2025
Comment on lines +16 to +19
JOIN(condition=t0.supplier_key == t1.key, type=INNER, cardinality=SINGULAR_FILTER, columns={'part_key': t0.part_key, 'ps_supply_cost': t0.ps_supply_cost, 'supplier_key': t0.supplier_key})
SCAN(table=tpch.PARTSUPP, columns={'part_key': ps_partkey, 'ps_supply_cost': ps_supplycost, 'supplier_key': ps_suppkey})
FILTER(condition=name == 'Supplier#000009450':string, columns={'key': key})
SCAN(table=tpch.SUPPLIER, columns={'key': s_suppkey, 'name': s_name})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster because now it is using the join with the filtered version of suppliers to aggressively filter all of the data before aggregation.

Comment on lines +10 to +11
FILTER(condition=order_priority == '1-URGENT':string, columns={'customer_key': customer_key, 'order_date': order_date})
SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'order_date': o_orderdate, 'order_priority': o_orderpriority})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These extra filters change the answer slightly, but also make it run much faster since we avoid massive scans feeding into aggregates/joins/window functions.

Comment on lines +10 to +12
JOIN(condition=t0.key == t1.order_key, type=LEFT, cardinality=SINGULAR_ACCESS, columns={'agg_0': t1.agg_0, 'customer_key': t0.customer_key, 'order_date': t0.order_date})
FILTER(condition=YEAR(order_date) == 1994:numeric, columns={'customer_key': customer_key, 'key': key, 'order_date': order_date})
SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'key': o_orderkey, 'order_date': o_orderdate})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: massively reducing the number of rows fed into a window function.

Comment on lines +12 to +13
FILTER(condition=brand == 'Brand#13':string, columns={'key': key, 'name': name})
SCAN(table=tpch.PART, columns={'brand': p_brand, 'key': p_partkey, 'name': p_name})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reducing the number of rows fed into the window function

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a question below

],
"avg_diff": [2195.0, 1998.0, 1995.0, 1863.0, 1787.0],
"avg_diff": [2361.0, 2260.0, 2309.0, 2269.0, 2309.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the expected values in this file change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He updated the actual query and now "Only consider Japanese customers and urgent orders." so this changes the expected results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the change to this test specifically also changed the answer to make the test run faster by having it compute the result from a smaller set of rows.

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Thanks Kian

@knassre-bodo knassre-bodo merged commit ae147ce into main Jun 20, 2025
5 checks passed
@knassre-bodo knassre-bodo deleted the kian/test_speedup branch June 20, 2025 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Alters the testing/CI process for PyDough
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants