Skip to content

[query] Rewrite relational index to no longer have metadata alongside index files themselves #14950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

chrisvittal
Copy link
Collaborator

Every index file for every partition currently writes metadata alongside, this creates a structure like:

9.hmt/index
|-- part-0-30bcc9ce-4797-427a-a9ba-bd0c688bddce.idx
|   |-- index
|   `-- metadata.json.gz
`-- part-1-e0447c57-8794-4e14-8e4f-c996fb431427.idx
    |-- index
    `-- metadata.json.gz

This creates a ton of files that are very slow to list and operate on, in addition to being completely unnecessary as most fields of the metadata is duplicated for every partition.

This change returns the per partition variable index metadata in the partition results, and then serializes it in one file.

Security Assessment

Delete all except the correct answer:

  • This change cannot impact the Hail Batch instance as deployed by Broad Institute in GCP

@chrisvittal chrisvittal marked this pull request as draft July 15, 2025 21:13
@chrisvittal chrisvittal force-pushed the query/flat-index-md branch from 70c3539 to 103c42b Compare July 17, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant