Skip to content

Update LLM.generate output to include statistics #1034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Nov 19, 2024
Merged

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Oct 11, 2024

Description

This PR updates the output from llm.generate to make it more feature rich.

Previously we only returned the generated text:

GenerateOutput = List[Union[str, None]]

Now it is updated to allow for statistics related to the generation:

LLMOutput = List[Union[str, None]]

class TokenCount(TypedDict):
    input_tokens: List[int]
    output_tokens: List[int]

LLMStatistics = Union[TokenCount, Dict[str, Any]]
"""Initially the LLMStatistics will contain the token count, but can have more variables.
They can be added once we have them defined for every LLM.
"""

class GenerateOutput(TypedDict):
    generations: LLMOutput
    statistics: LLMStatistics

This PR only includes input_tokens and output_tokens as statistics, but we can add as much as needed in the future.

This information is moved to distilabel_metadata in the following way, to avoid collisions between statistics of different steps:

{
    "generations": ["Hello Magpie"],
    f"statistics_{step_name}": {
        "input_tokens": [12],
        "output_tokens": [12],
    },
}

NOTE:
Most Task reuse the same Task.process method to process the generations, and nothing else has to be done, but for tasks like Magpie where the process method is overwritten, this has to be updated.

Closes #738

@plaguss plaguss added this to the 1.5.0 milestone Oct 11, 2024
@plaguss plaguss self-assigned this Oct 11, 2024
Copy link

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1034/

Copy link

codspeed-hq bot commented Oct 11, 2024

CodSpeed Performance Report

Merging #1034 will not alter performance

Comparing llm-generate-upgrade (7c6e18f) with develop (e830e25)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss added the enhancement New feature or request label Oct 14, 2024
@plaguss plaguss marked this pull request as ready for review October 25, 2024 07:17
@plaguss plaguss requested a review from gabrielmbmb October 25, 2024 07:58
@plaguss plaguss mentioned this pull request Nov 8, 2024
@gabrielmbmb gabrielmbmb changed the title Llm generate upgrade Update LLM.generate output to include statistics Nov 15, 2024
@plaguss plaguss merged commit 2469407 into develop Nov 19, 2024
8 checks passed
@plaguss plaguss deleted the llm-generate-upgrade branch November 19, 2024 08:41
@plaguss plaguss mentioned this pull request Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Update LLM.generate interface to allow returning arbitrary/extra stuff related to the generation
2 participants