-
Notifications
You must be signed in to change notification settings - Fork 3
Speeding up several PyDough tests #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s/documentation/annotations
Co-authored-by: Hadia Ahmed <[email protected]>
JOIN(condition=t0.supplier_key == t1.key, type=INNER, cardinality=SINGULAR_FILTER, columns={'part_key': t0.part_key, 'ps_supply_cost': t0.ps_supply_cost, 'supplier_key': t0.supplier_key}) | ||
SCAN(table=tpch.PARTSUPP, columns={'part_key': ps_partkey, 'ps_supply_cost': ps_supplycost, 'supplier_key': ps_suppkey}) | ||
FILTER(condition=name == 'Supplier#000009450':string, columns={'key': key}) | ||
SCAN(table=tpch.SUPPLIER, columns={'key': s_suppkey, 'name': s_name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is faster because now it is using the join with the filtered version of suppliers to aggressively filter all of the data before aggregation.
FILTER(condition=order_priority == '1-URGENT':string, columns={'customer_key': customer_key, 'order_date': order_date}) | ||
SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'order_date': o_orderdate, 'order_priority': o_orderpriority}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These extra filters change the answer slightly, but also make it run much faster since we avoid massive scans feeding into aggregates/joins/window functions.
JOIN(condition=t0.key == t1.order_key, type=LEFT, cardinality=SINGULAR_ACCESS, columns={'agg_0': t1.agg_0, 'customer_key': t0.customer_key, 'order_date': t0.order_date}) | ||
FILTER(condition=YEAR(order_date) == 1994:numeric, columns={'customer_key': customer_key, 'key': key, 'order_date': order_date}) | ||
SCAN(table=tpch.ORDERS, columns={'customer_key': o_custkey, 'key': o_orderkey, 'order_date': o_orderdate}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: massively reducing the number of rows fed into a window function.
FILTER(condition=brand == 'Brand#13':string, columns={'key': key, 'name': name}) | ||
SCAN(table=tpch.PART, columns={'brand': p_brand, 'key': p_partkey, 'name': p_name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reducing the number of rows fed into the window function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a question below
], | ||
"avg_diff": [2195.0, 1998.0, 1995.0, 1863.0, 1787.0], | ||
"avg_diff": [2361.0, 2260.0, 2309.0, 2269.0, 2309.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the expected values in this file change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He updated the actual query and now "Only consider Japanese customers and urgent orders." so this changes the expected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the change to this test specifically also changed the answer to make the test run faster by having it compute the result from a smaller set of rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Thanks Kian
To add minor incremental speedups to the PyDough e2e tests, rewrites several of the PyDough testing functions (particularly the ones used by the TPC-H custom pipeline tests) to run faster without really changing the core of what the question is doing (usually by adding filters to make scans of huge tables smaller before aggregations/joins/window funcitons). The modified tests were identified by running
pdunit -m "execute" --durations=100
to list out the 100 runtime tests with the longest runtime, amongst which some of the worst offenders (outside the main TPC-H queries) were shortened. Most of the affected tests were reduced from 2-12 seconds to 1-4 seconds when run locally. Accounting for the longer times in the Github Actions, and the double duration for 3.10/3.11, this adds up to shortening the time for a CI run to complete by about a minute (12.5 minutes to 11.5 minutes).