Eval Harness v2.0

The harness runs translation experiments and produces run cards. It handles prompt construction, API calls, scoring, and result serialization — you supply the dataset and the model.

Installation

Requirements: Python 3.10+

pip install sacrebleu aiohttp

Clone the harness repository:

git clone https://github.com/gamedaysuits/gds-mt-eval-harness.git
cd gds-mt-eval-harness

Usage

python eval/baseline_experiment.py --dataset path/to/dataset.json

This runs every entry in the dataset through the configured model, scores the outputs, and writes a run card JSON file to the results/ directory.

CLI Flags

Flag	Required	Default	Description
`--dataset`	✅	—	Path to the evaluation dataset JSON file
`--model`	—	`openai/gpt-4o`	OpenRouter model slug (e.g., `google/gemini-2.5-pro`)
`--condition`	—	`baseline`	Experiment label. Use to distinguish prompt strategies (e.g., `coached`, `few-shot`, `dictionary-augmented`)
`--temperature`	—	`0.3`	Sampling temperature. Lower = more deterministic
`--batch-size`	—	`5`	Number of entries per concurrent API batch
`--fst-analyzer`	—	`null`	Path to an FST analyzer binary. When provided, each output is tested for morphological acceptance
`--submit`	—	`false`	Submit the run card to the leaderboard API after the run completes

Examples

# Run with defaults (GPT-4o, baseline condition)
python eval/baseline_experiment.py --dataset data/edtekla-dev-v1.json

# Coached experiment with Gemini, lower temperature
python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --model google/gemini-2.5-pro \
  --condition coached-v3 \
  --temperature 0.1

# Run with FST validation and auto-submit
python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --fst-analyzer ./bin/crk-analyzer \
  --submit

Run Card Schema

Every experiment produces a run card — a self-contained JSON document. The top-level structure:

{
  "run_id": "uuid-v4",
  "harness_version": "2.0",
  "model_slug": "openai/gpt-4o",
  "model_id": "gpt-4o-2024-08-06",
  "condition": "baseline",
  "timestamp": "2025-05-20T03:22:41Z",
  "elapsed_seconds": 142.7,
  "dataset": { ... },
  "config": { ... },
  "system_prompt_sha256": "abc123...",
  "system_prompt_used": "You are a translator...",
  "fingerprint": { ... },
  "scores": { ... },
  "totals": { ... },
  "environment": { ... },
  "results": [ ... ],
  "run_card_hash": "sha256-of-entire-card"
}

See the Run Card Specification for the full schema with every field documented.

Key Blocks

dataset — Identifies which dataset was used, including its content hash so results are tied to a specific version:

{
  "id": "edtekla-dev-v1",
  "version": "1.0",
  "language_pair": "EN→CRK",
  "sha256": "...",
  "entry_count": 124
}

scores — Aggregate metrics for the run:

{
  "total": 124,
  "exact_matches": 12,
  "exact_match_rate": 0.0968,
  "fst_accepted": 87,
  "fst_acceptance_rate": 0.7016,
  "chrf_plus_plus": 42.31,
  "errors": 0,
  "avg_latency_seconds": 1.15,
  "median_latency_seconds": 1.02,
  "p95_latency_seconds": 2.34,
  "by_difficulty": { ... },
  "by_provenance": { ... }
}

totals — Token usage and cost tracking:

{
  "prompt_tokens": 48200,
  "completion_tokens": 3100,
  "reasoning_tokens": 0,
  "cached_tokens": 12000,
  "total_cost_usd": 0.42,
  "cost_per_entry_usd": 0.0034,
  "reasoning_ratio": 0.0
}

Fingerprint vs Run Card Hash

The harness produces two distinct hashes. They serve different purposes:

Fingerprint

The fingerprint answers: "Could this run be reproduced?"

It hashes the combination of inputs that define the experiment configuration — not the outputs:

Dataset SHA-256
Model slug
Condition label
System prompt SHA-256
Temperature
Harness version

Two runs with identical fingerprints used the same setup. Their results should be comparable (modulo API non-determinism).

Run Card Hash

The run card hash answers: "Has this specific result file been tampered with?"

It's the SHA-256 of the entire run card JSON (excluding the run_card_hash field itself). If any field changes — a score, a timestamp, a single output — the hash breaks.

:::info When to use which Use the fingerprint to group comparable runs (same experiment, different executions). Use the run card hash to verify integrity of a specific result file. :::

Submitting to the Leaderboard

Automatic submission

Pass --submit to upload the run card on completion:

python eval/baseline_experiment.py \
  --dataset data/edtekla-dev-v1.json \
  --submit

Manual submission

Run cards are saved as JSON files in results/. You can submit any run card file via the leaderboard UI at /leaderboard, or through the API:

curl -X POST https://i18n-rosetta.com/api/leaderboard/submit \
  -H "Content-Type: application/json" \
  -d @results/your-run-card.json

:::warning Leaderboard validation The leaderboard validates submitted run cards against the dataset registry. Submissions referencing unknown datasets, or with a broken run_card_hash, are rejected. :::

Installation​

Usage​

CLI Flags​

Examples​

Run Card Schema​

Key Blocks​

Fingerprint vs Run Card Hash​

Fingerprint​

Run Card Hash​

Submitting to the Leaderboard​

Automatic submission​

Manual submission​