Skip to main content

Method Leaderboard

Benchmarking translation methods for Indigenous and low‑resource languages with reproducible, fingerprinted evaluation.

Have a method to submit? Build a plugin and submit your scores →

Condition:

Loading leaderboard data...

Trust Levels
Self-benchmarkedActive
GDS VerifiedComing soon
Community ValidatedComing soon

⚠️ LLM outputs are non-deterministic. Scores represent point-in-time measurements under specific model versions and API configurations. Model providers may update weights, decoding strategies, or safety filters at any time, which can cause score drift between runs.

How It Works

  1. 1Fingerprinted Pipelines — Each submission is tied to a specific Git commit and pipeline configuration, ensuring results can be traced back to the exact code that produced them.
  2. 2Versioned Datasets — Evaluation datasets are content-hashed and versioned. Scores are only comparable within the same dataset version, preventing silent data contamination.
  3. 3Standardised Harness — All metrics are computed by the shared i18n-rosetta evaluation harness, eliminating implementation differences between submissions.
  4. 4Open Submission — Anyone can submit results by opening a pull request with their method's JSON entry and pipeline fingerprint. Verified and Community trust tiers will be available soon.