Skip to content

Implement framework to validate backwards compatibility of metrics #6278

@yurishkuro

Description

@yurishkuro

Jaeger binaries produce various metrics that can be used to monitor Jaeger itself in production, such as throughput, queue saturation, error rates, etc. We historically treated those metrics as a stable public API (we even provide a Grafana dashboard mixin), but we never actually had proper integration tests that validate that changes to the code do not break the metrics.

Proposal

We can enhance our existing integration tests to also perform comparison of the metrics. The full set of all possible metrics is never available from a single run because if some components are not utilized (e.g. adaptive sampling or a particular storage type) then their metrics will not be registered. However, since our integration tests cover most of the components, we can scape the metrics endpoint at the end of each test and combine them into a full picture.

Caveat: the exact shape of the metrics depends on all the nested namespaces that could be applied to the metrics.Factory, so it is sensitive to the exact code in the main functions, which is where metrics.Factory always originates. Our integration tests for Jaeger v2 usually use the actual binary, so the resulting metrics will reflect how that binary performs in production. But all integration tests for v1 are run from a unit testing framework and the metrics.Factory initialization may not match how it's done in the mains. So we may only be able to solve this for Jaeger v2, which is fine.

Approach

  • At the end of each integration test (CIT) workflow we scrape the metrics collected by the binary and upload them as a github artifact
  • Then the final workflow can gather all those artifacts and compare them with similar reports from the latest release. If differences are found it can upload them as another artifact and link to the PR so that maintainers can inspect the changes and decide if they are acceptable.
  • The artifacts uploaded for the official release can also be referenced from the documentation website as a way of documenting the current collection of metrics.

Help Wanted

We seek community help to implement this functionality. This is not a 30min fix, but we still marked it as good-first-issue because it can be done incrementally.

Tasks

  • currently all CIT workflows are independent. In order to be able to have a final job once all CIT jobs are finished we may need to combine them all into a single CIT workflow with multiple jobs
  • the ability to scrape and compare metrics was implemented in [V2]Add Script for metrics markdown table #5941. We need to integrate the scraping into each CIT workflow (using a helper script) and upload the output as workflow artifacts
  • implement the validation job in a workflow that would compare artifacts from the current PR with those from the latest release and generate a diff (also uploaded as a separate artifact)
  • make the validation job post some form of summary as a comment to the PR (or as the output of the Check)
  • implement a way to incorporate the metrics report into documentation website

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementgood first issueGood for beginnershelp wantedFeatures that maintainers are willing to accept but do not have cycles to implementv2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions