Skip to content

[BSE-4318] Support for converting TopK + Order By #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Dec 4, 2024
Merged

Conversation

njriasan
Copy link
Contributor

@njriasan njriasan commented Dec 2, 2024

Supports TopK conversion into a limit without having ordering support.

@njriasan njriasan requested a review from knassre-bodo December 2, 2024 19:52
Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few notes. Will approve after reconciled with aggregations, and preferably after I see what the followup with orderings looks like.

pytest.param(
TableCollectionInfo("Regions")
** SubCollectionInfo("nations")
** TopKInfo([], 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this isn't allowed in PyDough. It is only reachable in the DAG via artificial construction (the qualifier should raise a fit), and only because I don't think I ever bothered to check for it in the node builder for TopK (though now I think I should have).

Bare-limit is perfectly viable in SQL (albeit, nondeterministic) but is not part of the PyDough semantics as we outlined them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to omit this test though? Seems reasonable that it could become legal or that some other operation could eventually map to this.

Copy link
Contributor

@knassre-bodo knassre-bodo Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I guess we can keep, though I maintain that for now it is not possible to create except via artificial creation of AST nodes.

Comment on lines 475 to 479
out_columns: dict[HybridExpr, ColumnReference] = {
expr: context.expressions[expr].with_input(None)
for expr in context.expressions
}
return TranslationOutput(out_rel, out_columns, context.join_keys)
Copy link
Contributor

@knassre-bodo knassre-bodo Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. No columns are adjusted at all, and nothing in out_columns has an input anyways (even for joins) since out_columns represents how subsequent rel nodes should access terms from this node. Can just pass in context.expressions).

All new terms related to the ordering, etc. should have already been defined by handle_children, whose output is the input to this method.

""",
id="simple_topk",
),
pytest.param(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting thing worth testing, once aggregations are reconciled, is this:

Nations(name, n_top_suppliers=COUNT(suppliers.TOPK(100, by=acctbal).key))

This should find the global top 100k, and count how many of those belong to each nation.

@@ -382,6 +383,34 @@
""",
id="filter_back",
),
pytest.param(
TableCollectionInfo("Regions") ** TopKInfo([], 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we replaced these with legit orderbys once that is added, since this should be banned. Also, once we do so, can we add a check to the AST node builder for OrderBy and TopK (or just their constructors) to explicitly ban when the collations list is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update these tests + this check once the order by code is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the followup PR, but since with_collation is separate I have to do it in the caller. I can refactor this as needed later.

case TopK():
hybrid = make_hybrid_tree(node.preceding_context)
hybrid.populate_children(node, child_ref_mapping)
# TODO: support collation. Requires order by handling.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the next PR.

pytest.param(
TableCollectionInfo("Regions")
** SubCollectionInfo("nations")
** TopKInfo([], 10)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to omit this test though? Seems reasonable that it could become legal or that some other operation could eventually map to this.

@@ -382,6 +383,34 @@
""",
id="filter_back",
),
pytest.param(
TableCollectionInfo("Regions") ** TopKInfo([], 2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update these tests + this check once the order by code is done.

Base automatically changed from nick/support_basic_where to main December 3, 2024 16:10
@njriasan njriasan requested a review from knassre-bodo December 3, 2024 21:03
Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer if we merge the child into this before merging into main, but LGTM!

@njriasan njriasan changed the title [BSE-4318] Support for converting TopK to limit without ordering [BSE-4318] Support for converting TopK + Order By Dec 4, 2024
@njriasan njriasan requested a review from knassre-bodo December 4, 2024 20:24
@njriasan njriasan merged commit 211e97d into main Dec 4, 2024
4 checks passed
@njriasan njriasan deleted the nick/support_top_k branch December 4, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants