-
Notifications
You must be signed in to change notification settings - Fork 3
[BSE-4318] Support for converting TopK + Order By #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few notes. Will approve after reconciled with aggregations, and preferably after I see what the followup with orderings looks like.
tests/test_ast_conversion.py
Outdated
pytest.param( | ||
TableCollectionInfo("Regions") | ||
** SubCollectionInfo("nations") | ||
** TopKInfo([], 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this isn't allowed in PyDough. It is only reachable in the DAG via artificial construction (the qualifier should raise a fit), and only because I don't think I ever bothered to check for it in the node builder for TopK
(though now I think I should have).
Bare-limit is perfectly viable in SQL (albeit, nondeterministic) but is not part of the PyDough semantics as we outlined them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to omit this test though? Seems reasonable that it could become legal or that some other operation could eventually map to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 I guess we can keep, though I maintain that for now it is not possible to create except via artificial creation of AST nodes.
out_columns: dict[HybridExpr, ColumnReference] = { | ||
expr: context.expressions[expr].with_input(None) | ||
for expr in context.expressions | ||
} | ||
return TranslationOutput(out_rel, out_columns, context.join_keys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary. No columns are adjusted at all, and nothing in out_columns
has an input anyways (even for joins) since out_columns
represents how subsequent rel nodes should access terms from this node. Can just pass in context.expressions
).
All new terms related to the ordering, etc. should have already been defined by handle_children
, whose output is the input to this method.
""", | ||
id="simple_topk", | ||
), | ||
pytest.param( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting thing worth testing, once aggregations are reconciled, is this:
Nations(name, n_top_suppliers=COUNT(suppliers.TOPK(100, by=acctbal).key))
This should find the global top 100k, and count how many of those belong to each nation.
tests/test_ast_conversion.py
Outdated
@@ -382,6 +383,34 @@ | |||
""", | |||
id="filter_back", | |||
), | |||
pytest.param( | |||
TableCollectionInfo("Regions") ** TopKInfo([], 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we replaced these with legit orderbys once that is added, since this should be banned. Also, once we do so, can we add a check to the AST node builder for OrderBy and TopK (or just their constructors) to explicitly ban when the collations list is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update these tests + this check once the order by code is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the followup PR, but since with_collation
is separate I have to do it in the caller. I can refactor this as needed later.
pydough/conversion/hybrid_tree.py
Outdated
case TopK(): | ||
hybrid = make_hybrid_tree(node.preceding_context) | ||
hybrid.populate_children(node, child_ref_mapping) | ||
# TODO: support collation. Requires order by handling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in the next PR.
tests/test_ast_conversion.py
Outdated
pytest.param( | ||
TableCollectionInfo("Regions") | ||
** SubCollectionInfo("nations") | ||
** TopKInfo([], 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to omit this test though? Seems reasonable that it could become legal or that some other operation could eventually map to this.
tests/test_ast_conversion.py
Outdated
@@ -382,6 +383,34 @@ | |||
""", | |||
id="filter_back", | |||
), | |||
pytest.param( | |||
TableCollectionInfo("Regions") ** TopKInfo([], 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update these tests + this check once the order by code is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer if we merge the child into this before merging into main, but LGTM!
Supports TopK conversion into a limit without having ordering support.