Method Leaderboard | i18n-rosetta

Condition:

Loading leaderboard data...

Trust Levels

Self-benchmarkedActive

GDS VerifiedComing soon

Community ValidatedComing soon

⚠️ LLM outputs are non-deterministic. Scores represent point-in-time measurements under specific model versions and API configurations. Model providers may update weights, decoding strategies, or safety filters at any time, which can cause score drift between runs.

How It Works

1Fingerprinted Pipelines — Each submission is tied to a specific Git commit and pipeline configuration, ensuring results can be traced back to the exact code that produced them.
2Versioned Datasets — Evaluation datasets are content-hashed and versioned. Scores are only comparable within the same dataset version, preventing silent data contamination.
3Standardised Harness — All metrics are computed by the shared i18n-rosetta evaluation harness, eliminating implementation differences between submissions.
4Open Submission — Anyone can submit results by opening a pull request with their method's JSON entry and pipeline fingerprint. Verified and Community trust tiers will be available soon.