Saltar al contenido principal

Eval Harness v2.0

The harness runs translation experiments and produces run cards. It handles prompt construction, API calls, scoring, and result serialization — you supply the dataset and the model.

Installation

Requirements: Python 3.10+

pip install sacrebleu aiohttp

Clone the harness repository:

git clone https://github.com/gamedaysuits/gds-mt-eval-harness.git
cd gds-mt-eval-harness

Usage

python eval/baseline_experiment.py --dataset path/to/dataset.json

This runs every entry in the dataset through the configured model, scores the outputs, and writes a run card JSON file to the results/ directory.

CLI Flags

FlagRequiredDefaultDescription
--datasetPath to the evaluation dataset JSON file
--modelopenai/gpt-4oOpenRouter model slug (e.g., google/gemini-2.5-pro)
--conditionbaselineExperiment label. Use to distinguish prompt strategies (e.g., coached, few-shot, dictionary-augmented)
--temperature0.3Sampling temperature. Lower = more deterministic
--batch-size5Number of entries per concurrent API batch
--fst-analyzernullPath to an FST analyzer binary. When provided, each output is tested for morphological acceptance
--submitfalseSubmit the run card to the leaderboard API after the run completes

Examples

# Run with defaults (GPT-4o, baseline condition)
python eval/baseline_experiment.py --dataset data/edtekla-dev-v1.json

# Coached experiment with Gemini, lower temperature
python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--model google/gemini-2.5-pro \
--condition coached-v3 \
--temperature 0.1

# Run with FST validation and auto-submit
python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--fst-analyzer ./bin/crk-analyzer \
--submit

Run Card Schema

Every experiment produces a run card — a self-contained JSON document. The top-level structure:

{
"run_id": "uuid-v4",
"harness_version": "2.0",
"model_slug": "openai/gpt-4o",
"model_id": "gpt-4o-2024-08-06",
"condition": "baseline",
"timestamp": "2025-05-20T03:22:41Z",
"elapsed_seconds": 142.7,
"dataset": { ... },
"config": { ... },
"system_prompt_sha256": "abc123...",
"system_prompt_used": "You are a translator...",
"fingerprint": { ... },
"scores": { ... },
"totals": { ... },
"environment": { ... },
"results": [ ... ],
"run_card_hash": "sha256-of-entire-card"
}

See the Run Card Specification for the full schema with every field documented.

Key Blocks

dataset — Identifies which dataset was used, including its content hash so results are tied to a specific version:

{
"id": "edtekla-dev-v1",
"version": "1.0",
"language_pair": "EN→CRK",
"sha256": "...",
"entry_count": 124
}

scores — Aggregate metrics for the run:

{
"total": 124,
"exact_matches": 12,
"exact_match_rate": 0.0968,
"fst_accepted": 87,
"fst_acceptance_rate": 0.7016,
"chrf_plus_plus": 42.31,
"errors": 0,
"avg_latency_seconds": 1.15,
"median_latency_seconds": 1.02,
"p95_latency_seconds": 2.34,
"by_difficulty": { ... },
"by_provenance": { ... }
}

totals — Token usage and cost tracking:

{
"prompt_tokens": 48200,
"completion_tokens": 3100,
"reasoning_tokens": 0,
"cached_tokens": 12000,
"total_cost_usd": 0.42,
"cost_per_entry_usd": 0.0034,
"reasoning_ratio": 0.0
}

Fingerprint vs Run Card Hash

The harness produces two distinct hashes. They serve different purposes:

Fingerprint

The fingerprint answers: "Could this run be reproduced?"

It hashes the combination of inputs that define the experiment configuration — not the outputs:

  • Dataset SHA-256
  • Model slug
  • Condition label
  • System prompt SHA-256
  • Temperature
  • Harness version

Two runs with identical fingerprints used the same setup. Their results should be comparable (modulo API non-determinism).

Run Card Hash

The run card hash answers: "Has this specific result file been tampered with?"

It's the SHA-256 of the entire run card JSON (excluding the run_card_hash field itself). If any field changes — a score, a timestamp, a single output — the hash breaks.

:::info When to use which Use the fingerprint to group comparable runs (same experiment, different executions). Use the run card hash to verify integrity of a specific result file. :::


Submitting to the Leaderboard

Automatic submission

Pass --submit to upload the run card on completion:

python eval/baseline_experiment.py \
--dataset data/edtekla-dev-v1.json \
--submit

Manual submission

Run cards are saved as JSON files in results/. You can submit any run card file via the leaderboard UI at /leaderboard, or through the API:

curl -X POST https://i18n-rosetta.com/api/leaderboard/submit \
-H "Content-Type: application/json" \
-d @results/your-run-card.json

:::warning Leaderboard validation The leaderboard validates submitted run cards against the dataset registry. Submissions referencing unknown datasets, or with a broken run_card_hash, are rejected. :::