Run Card Specification

The run card is the complete record of a single evaluation run. It contains everything needed to understand, reproduce, and verify the experiment: configuration, scores, individual results, token usage, and environment metadata.

Schema version: 2.0

Top-Level Fields

Field	Type	Description
`run_id`	`string`	UUID v4 generated at the start of the run
`harness_version`	`string`	Semantic version of the harness that produced this card (e.g., `2.0`)
`model_slug`	`string`	OpenRouter model slug used for the run (e.g., `openai/gpt-4o`)
`model_id`	`string`	Resolved model identifier returned by the API (e.g., `gpt-4o-2024-08-06`)
`condition`	`string`	Experiment label (e.g., `baseline`, `coached-v3`, `few-shot`)
`timestamp`	`string`	ISO 8601 UTC timestamp when the run started
`elapsed_seconds`	`number`	Wall-clock duration of the entire run

{
  "run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "harness_version": "2.0",
  "model_slug": "openai/gpt-4o",
  "model_id": "gpt-4o-2024-08-06",
  "condition": "baseline",
  "timestamp": "2025-05-20T03:22:41Z",
  "elapsed_seconds": 142.7
}

`dataset`

Identifies the evaluation dataset and pins it to a specific content version via SHA-256.

Field	Type	Description
`id`	`string`	Dataset identifier (e.g., `edtekla-dev-v1`)
`version`	`string`	Dataset version string
`language_pair`	`string`	Display label (e.g., `EN→CRK`)
`sha256`	`string`	SHA-256 hash of the dataset file contents. Guarantees the exact data used
`entry_count`	`number`	Number of entries in the dataset

{
  "dataset": {
    "id": "edtekla-dev-v1",
    "version": "1.0",
    "language_pair": "EN→CRK",
    "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "entry_count": 124
  }
}

`config`

The API and batching configuration used for this run.

Field	Type	Description
`api_provider`	`string`	API provider name (e.g., `openrouter`)
`temperature`	`number`	Sampling temperature
`max_tokens`	`number`	Maximum tokens per completion
`batch_size`	`number`	Entries per concurrent batch
`concurrency`	`number`	Maximum parallel API requests

{
  "config": {
    "api_provider": "openrouter",
    "temperature": 0.3,
    "max_tokens": 1024,
    "batch_size": 5,
    "concurrency": 3
  }
}

`system_prompt_sha256` / `system_prompt_used`

Field	Type	Description
`system_prompt_sha256`	`string`	SHA-256 hash of the system prompt. Included in the fingerprint
`system_prompt_used`	`string`	The full system prompt text sent to the model

The prompt hash is part of the fingerprint — two runs with different prompts will have different fingerprints even if all other settings match.

`fingerprint`

A reproducibility identifier. Two runs with identical fingerprints used the same experimental setup.

Field	Type	Description
`hash`	`string`	SHA-256 hash of the sorted components
`components`	`object`	The input values that were hashed

Fingerprint Components

Component	Description
`dataset_sha256`	Hash of the dataset file
`model_slug`	Model used
`condition`	Experiment condition label
`system_prompt_sha256`	Hash of the system prompt
`temperature`	Sampling temperature
`harness_version`	Harness version

{
  "fingerprint": {
    "hash": "7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069",
    "components": {
      "dataset_sha256": "e3b0c44298fc1c14...",
      "model_slug": "openai/gpt-4o",
      "condition": "baseline",
      "system_prompt_sha256": "abc123...",
      "temperature": 0.3,
      "harness_version": "2.0"
    }
  }
}

:::info Fingerprint ≠ Run Card Hash The fingerprint identifies the experiment configuration. The run_card_hash verifies the result file integrity. See Fingerprint vs Run Card Hash for details. :::

`scores`

Aggregate metrics for the entire run.

Top-Level Scores

Field	Type	Description
`total`	`number`	Total entries evaluated
`exact_matches`	`number`	Entries where output exactly matched the gold standard
`exact_match_rate`	`number`	`exact_matches / total` (0.0–1.0)
`fst_accepted`	`number`	Entries where the FST analyzer accepted the output
`fst_acceptance_rate`	`number`	`fst_accepted / total` (0.0–1.0). `null` if no FST analyzer was used
`chrf_plus_plus`	`number`	Corpus-level chrF++ score (0–100)
`errors`	`number`	Entries that failed (API error, timeout, etc.)
`avg_latency_seconds`	`number`	Mean response time across all entries
`median_latency_seconds`	`number`	Median response time
`p95_latency_seconds`	`number`	95th percentile response time

`by_difficulty`

Scores broken down by difficulty tier. Each key (easy, medium, hard) contains the same metric fields as the top-level scores.

{
  "by_difficulty": {
    "easy": {
      "total": 42,
      "exact_matches": 8,
      "exact_match_rate": 0.1905,
      "chrf_plus_plus": 51.2,
      "fst_accepted": 35,
      "fst_acceptance_rate": 0.8333
    },
    "medium": { ... },
    "hard": { ... }
  }
}

`by_provenance`

Scores broken down by entry provenance. Each key (e.g., gold_standard, textbook) contains the same metric fields.

{
  "by_provenance": {
    "gold_standard": {
      "total": 80,
      "exact_matches": 10,
      "exact_match_rate": 0.125,
      "chrf_plus_plus": 44.8
    },
    "textbook": { ... }
  }
}

`totals`

Token usage and cost tracking for the entire run.

Field	Type	Description
`prompt_tokens`	`number`	Total input tokens across all API calls
`completion_tokens`	`number`	Total output tokens
`reasoning_tokens`	`number`	Tokens used for chain-of-thought reasoning (model-dependent, 0 for most models)
`cached_tokens`	`number`	Tokens served from the provider's prompt cache
`total_cost_usd`	`number`	Total cost in USD (as reported by the API)
`cost_per_entry_usd`	`number`	`total_cost_usd / entry_count`
`reasoning_ratio`	`number`	`reasoning_tokens / completion_tokens` (0.0–1.0)

{
  "totals": {
    "prompt_tokens": 48200,
    "completion_tokens": 3100,
    "reasoning_tokens": 0,
    "cached_tokens": 12000,
    "total_cost_usd": 0.42,
    "cost_per_entry_usd": 0.0034,
    "reasoning_ratio": 0.0
  }
}

`environment`

Runtime environment metadata for reproducibility.

Field	Type	Description
`harness_version`	`string`	Harness version (mirrors top-level `harness_version`)
`harness_git_commit`	`string`	Git commit SHA of the harness at run time
`python_version`	`string`	Python interpreter version
`sacrebleu_version`	`string`	sacrebleu library version (used for chrF++ scoring)
`os`	`string`	Operating system identifier

{
  "environment": {
    "harness_version": "2.0",
    "harness_git_commit": "a1b2c3d",
    "python_version": "3.11.9",
    "sacrebleu_version": "2.4.0",
    "os": "macOS-14.5-arm64"
  }
}

`results[]`

The per-entry results array. One object per dataset entry, in index order.

Field	Type	Description
`entry_index`	`number`	Index of this entry in the dataset (matches `entries[].index`)
`source_text`	`string`	The source text that was translated
`target_expected`	`string`	The gold-standard reference from the dataset
`target_output`	`string`	The model's actual output
`exact_match`	`boolean`	Whether `target_output === target_expected`
`entry_chrf`	`number`	Sentence-level chrF++ score for this entry (0–100)
`fst_accepted`	`boolean \| null`	Whether the FST analyzer accepted the output. `null` if no analyzer was configured
`fst_analysis`	`string[]`	FST analysis strings for the output (empty array if not analyzed or rejected)
`difficulty`	`string`	Difficulty tier from the dataset (`easy`, `medium`, `hard`)
`provenance`	`string`	Provenance tag from the dataset
`latency_seconds`	`number`	Response time for this individual entry
`usage`	`object`	Per-entry token usage: `{ prompt_tokens, completion_tokens, reasoning_tokens }`
`error`	`string \| null`	Error message if this entry failed. `null` on success

{
  "results": [
    {
      "entry_index": 0,
      "source_text": "Hello",
      "target_expected": "tânisi",
      "target_output": "tânisi",
      "exact_match": true,
      "entry_chrf": 100.0,
      "fst_accepted": true,
      "fst_analysis": ["tânisi+V+AI+Ind+2Sg"],
      "difficulty": "easy",
      "provenance": "gold_standard",
      "latency_seconds": 0.82,
      "usage": {
        "prompt_tokens": 385,
        "completion_tokens": 12,
        "reasoning_tokens": 0
      },
      "error": null
    }
  ]
}

`run_card_hash`

Field	Type	Description
`run_card_hash`	`string`	SHA-256 hash of the entire run card JSON, with the `run_card_hash` field itself set to `""` during hashing

This is the tamper-detection seal. The leaderboard re-computes this hash on submission and rejects cards where it doesn't match.

Computing the hash:

Serialize the run card to JSON with run_card_hash set to ""
Compute SHA-256 of the serialized string
Set run_card_hash to the resulting hex digest

import hashlib, json

card["run_card_hash"] = ""
card_json = json.dumps(card, sort_keys=True, ensure_ascii=False)
card["run_card_hash"] = hashlib.sha256(card_json.encode()).hexdigest()

Top-Level Fields​

dataset​

config​

system_prompt_sha256 / system_prompt_used​

fingerprint​

Fingerprint Components​

scores​

Top-Level Scores​

by_difficulty​

by_provenance​

totals​

environment​

results[]​

run_card_hash​