MT Evaluation

The MT Eval Harness is a language-pair-agnostic recording surface for machine translation experiments. It doesn't judge how you translate — it measures what comes out.

Run any translation approach against a standardized dataset. The harness produces a run card — a self-contained JSON document recording every input, output, metric, and environment detail. Run cards are the unit of comparison: deterministic, reproducible, and auditable.

How It Works

Pick a dataset — A curated set of source→target pairs with gold-standard references
Run the harness — It sends each entry through your model, collects outputs, and scores them
Get a run card — chrF++, exact match rate, FST acceptance, latency, cost — all in one JSON file
Submit to the leaderboard — Compare your method against others on the same dataset

Links

Leaderboard — Live rankings across datasets and language pairs
Eval Harness on GitHub — Source code, installation, and contribution guide
Harness Documentation — Installation, CLI flags, and run card schema
Evaluation Datasets — Available datasets and the dataset format specification
Run Card Specification — Full v2.0 schema reference

:::danger DO NOT TRAIN on evaluation data

These datasets exist to measure translation quality, not to improve it. Methods trained, fine-tuned, or few-shot-prompted with evaluation data will produce artificially inflated scores and will be disqualified from the leaderboard.

If you need training data, use separate corpora. The evaluation sets must remain unseen by your model. :::

What Gets Measured

The harness computes several complementary metrics per run:

Metric	What it tells you
chrF++	Character-level n-gram overlap with the gold standard. Language-agnostic, no tokenizer needed.
Exact match rate	Percentage of outputs identical to the gold standard. A high bar — useful for morphologically precise targets.
FST acceptance rate	Percentage of outputs accepted by a finite-state transducer analyzer (when available). Measures morphological validity.
Latency (avg / median / p95)	How fast your method responds.
Cost (total / per-entry)	Token usage and USD cost for API-based methods.

Next Steps

Install the harness — Python 3.10+, two pip packages
Run your first experiment — One command, one dataset, one run card
Browse datasets — See what's available for benchmarking

How It Works​

Links​

What Gets Measured​

Next Steps​

How It Works

Links

What Gets Measured

Next Steps