Skip to content

Edawson/scdl schema #1030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Edawson/scdl schema #1030

wants to merge 30 commits into from

Conversation

edawson
Copy link
Collaborator

@edawson edawson commented Aug 8, 2025

Description

This MR implements a strict schema-defined header for SCDL archives. This header stores metadata about the archive and its composite arrays, including a version, the array lengths and data types, and information about the RowFeatureIndexes. This adds the features necessary to fix #999 as well as implement simple bit-packing of the rowptr, colptr, and data arrays. It also should make SCDL more secure, enable strict compatibility checking, and open the door to more performance improvements.

Note: I am still wiring up the header to the archive. I will make a note here when the MR is ready.

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

This change is opaque to the user - the headers are not human-readable on disk. For a full description of the format and how to interact with it, see the schema directory in SCDL's source directory.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Copy link

copy-pr-bot bot commented Aug 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yzhang123
Copy link
Collaborator

/ok to test 93b78e3

Copy link

copy-pr-bot bot commented Aug 15, 2025

/ok to test 93b78e3

@yzhang123, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@yzhang123
Copy link
Collaborator

\ok to test 93b78e3

@yzhang123
Copy link
Collaborator

/ok to test 93b78e3

Copy link

copy-pr-bot bot commented Aug 15, 2025

/ok to test 93b78e3

@yzhang123, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@yzhang123
Copy link
Collaborator

/ok to test e768f60

Signed-off-by: Eric T. Dawson <[email protected]>
@yzhang123
Copy link
Collaborator

/ok to test cb53ac7

@codecov-commenter
Copy link

❌ 45 Tests Failed:

Tests completed Failed Passed Skipped
1180 45 1135 52
View the top 3 failed test(s) by shortest run time
sub-packages/bionemo-scdl/tests/bionemo/scdl/schema/test_header.py::TestSchemaCompliance::test_core_header_layout
Stack Traces | 0.001s run time
self = <schema.test_header.TestSchemaCompliance object at 0x7ed9243d5700>

    def test_core_header_layout(self):
        """Test core header layout matches schema specification."""
        header = SCDLHeader()
        serialized = header.serialize()
    
        # Schema specifies 16-byte core header
        assert len(serialized) >= 16
    
        # Magic number at offset 0x00 (4 bytes)
        assert serialized[0:4] == SCDL_MAGIC_NUMBER
    
        # Version at offsets 0x04, 0x05, 0x06 (3 bytes)
        assert serialized[4] == 0  # major
        assert serialized[5] == 0  # minor
>       assert serialized[6] == 2  # point
E       assert 9 == 2

.../scdl/schema/test_header.py:681: AssertionError
sub-packages/bionemo-scdl/tests/bionemo/scdl/schema/test_header.py::TestSchemaCompliance::test_current_version_matches_schema
Stack Traces | 0.001s run time
self = <schema.test_header.TestSchemaCompliance object at 0x7ed9243d5250>

    def test_current_version_matches_schema(self):
        """Test current version matches schema documentation."""
        # Schema documents version 0.0.2
        current = CurrentSCDLVersion()
        assert current.major == 0
        assert current.minor == 0
>       assert current.point == 2
E       assert 9 == 2
E        +  where 9 = SCDLVersion(major=0, minor=0, point=9).point

.../scdl/schema/test_header.py:654: AssertionError
sub-packages/bionemo-scdl/tests/bionemo/scdl/schema/test_header.py::TestSCDLHeader::test_json_output
Stack Traces | 0.002s run time
self = <schema.test_header.TestSCDLHeader object at 0x7ed9243d45f0>

    def test_json_output(self):
        """Test JSON representation."""
        header = SCDLHeader()
        array = ArrayInfo("test.dat", 1000, ArrayDType.FLOAT32_ARRAY, (100, 10))
        header.add_array(array)
    
        json_str = header.to_json()
        json_data = json.loads(json_str)
    
        assert json_data["version"]["major"] == 0
        assert json_data["version"]["minor"] == 0
>       assert json_data["version"]["point"] == 2
E       assert 9 == 2

.../scdl/schema/test_header.py:632: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

…tests are skipped for legacy test data.

Signed-off-by: Eric T. Dawson <[email protected]>
@edawson
Copy link
Collaborator Author

edawson commented Aug 16, 2025

/ok to test: 9ecc929

@edawson
Copy link
Collaborator Author

edawson commented Aug 16, 2025

/ok to test 73982f1

Signed-off-by: Eric T. Dawson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Duplicate parquet files when concatenating 10 or more h5ads
5 participants