Skip to main content

Conlangs, Scripts & Orthography

rosetta has first-class support for constructed languages via LLM registers and deterministic script converters. This guide covers how conlang support works, what fonts you need, and how to add your own.

:::tip Why conlangs matter Conlangs aren't just novelty — they exercise the exact same infrastructure used for real underserved languages. The quality gate, coaching system, and script conversion pipeline work identically for Klingon and Plains Cree. If your conlang pipeline works, your low-resource language pipeline will too. :::


Supported Constructed Languages

LanguageCodeScript ConverterFont Required
Klingontlh✅ Romanization → pIqaDPUA font (e.g., pIqaD qolqoS)
Sindarin (Tolkien Elvish)x-elvish-s✅ Latin → TengwarCSUR PUA font
Kryptonianx-kryptonian✅ Latin → KryptonianPUA font
Pirate Englishx-pirate❌ register onlyNone
Shakespearean Englishx-shakespeare❌ register onlyNone
Yoda-speakx-yoda❌ register onlyNone

Conlang codes use the x- prefix per BCP-47 private-use convention, except Klingon (tlh) which has an ISO 639-3 code assigned by SIL International.


Unicode, PUA, and Font Requirements

The Private Use Area

Klingon (pIqaD), Sindarin (Tengwar), and Kryptonian use Unicode Private Use Area (PUA) characters. PUA is the range U+E000–U+F8FF — these codepoints have no standard assignment. The ConScript Unicode Registry (CSUR) maintains community-agreed mappings for fictional scripts, but these are not part of the Unicode standard.

What this means in practice:

  • PUA text renders as empty boxes (□□□) without the correct font loaded
  • Different fonts may map different glyphs to the same PUA codepoints
  • rosetta does NOT bundle PUA fonts — you must load them yourself
  • System fonts will never render these characters

PUA Ranges by Script

ScriptPUA RangeCSUR Reference
Klingon (pIqaD)U+F8D0–U+F8FFCSUR Klingon
Tengwar (Elvish)U+E000–U+E07FCSUR Tengwar
KryptonianVaries by fontNo CSUR standard

Loading PUA Web Fonts

rosetta includes a built-in command to download and manage PUA web fonts:

# See which fonts are needed for your configured languages
i18n-rosetta fonts list

# Download all needed fonts (auto-detects project type for output directory)
i18n-rosetta fonts install

# Also generate a CSS snippet with @font-face declarations
i18n-rosetta fonts install --css

The fonts install command downloads from verified open-source repositories:

FontScriptLicenseSource
pIqaD qolqoSKlingonSIL Open Font License 1.1GitHub
FreeMonoTengwarTengwarGNU GPL v3 (with font exception)SourceForge
(user-provided)KryptonianVariesNo open-source PUA font available

The output directory is auto-detected from your project structure (Docusaurus → static/fonts/, Hugo → static/fonts/, default → public/fonts/). Override with --dir.

If you prefer to manage fonts manually, add @font-face rules in your CSS:

@font-face {
font-family: 'pIqaD';
src: url('/fonts/pIqaDqolqoS.ttf') format('truetype');
font-display: swap;
unicode-range: U+F8D0-F8FF;
}

/* Apply to Klingon text elements */
[lang="tlh"], [data-script="piqad"] {
font-family: 'pIqaD', sans-serif;
}

:::warning Unicode support is NOT guaranteed The Unicode Consortium has explicitly declined to encode fictional scripts in the standard. PUA assignments are community-maintained and may conflict between font implementations. Always specify the exact font your project uses, and test rendering across browsers. :::


Script Converters

How They Work

rosetta's script conversion is a post-translation hook:

  1. The LLM translates text into a working script (usually Latin or SRO)
  2. The quality gate validates the output
  3. The deterministic converter transforms the validated text into the display script
  4. The converted text is written to disk

This two-step approach works because LLMs produce better output when working in Latin-based scripts. The deterministic converter guarantees correct script output without relying on the model's (often unreliable) script knowledge.

All Five Converters

rosetta ships with five built-in script converters:

Plains Cree: SRO → Syllabics (crk)

Standard Roman Orthography to Canadian Aboriginal Syllabics.

Input: "tawâw"
Output: "ᑕᐚᐤ"

Long vowels use macron/circumflex: ê, î, ô, â. The converter handles all SRO diacritics and maps them to the correct syllabic characters. See Support a Low-Resource Language for the full Cree pipeline.

Serbian: Latin → Cyrillic (sr)

Deterministic Latin-to-Cyrillic conversion for Serbian.

Input: "zdravo"
Output: "здраво"

This handles the full Serbian alphabet mapping including digraphs (lj → љ, nj → њ, dž → џ).

Klingon: Romanization → pIqaD (tlh)

Marc Okrand's romanization system to pIqaD PUA characters.

Input: "Qapla'" (romanized Klingon)
Output: [pIqaD PUA] (requires pIqaD font to render)

Sindarin: Latin → Tengwar (x-elvish-s)

Tolkien's Sindarin mode Tengwar mapping.

Input: "elen síla" (Latin Sindarin)
Output: [Tengwar PUA] (requires Tengwar font to render)

Kryptonian: Latin → Kryptonian (x-kryptonian)

Fan-lexicon Kryptonian script mapping.

Input: "Kal-El"
Output: [Kryptonian PUA] (requires Kryptonian font to render)

Triggering a Converter

Set the scripts field in your language config. For built-in converters, this is auto-detected from the language code:

{
"languages": {
"sr": { "scripts": "sr" },
"crk": {}
}
}

Plains Cree (crk) auto-detects — you don't need to set scripts explicitly.


Multi-Script Languages

Some real languages use multiple active scripts:

LanguageScriptsrosetta Approach
SerbianLatin + CyrillicScript converter (sr) — translate in Latin, convert to Cyrillic
ChineseSimplified + TraditionalSeparate locale codes (zh vs zh-TW) with distinct registers

For languages where both scripts serve the same audience (Serbian), use a script converter. For languages where the scripts serve different audiences (Chinese Simplified for mainland China, Traditional for Taiwan/HK), use separate locale codes.


Orthography Notes

Registers aren't just tone — they carry orthographic instructions that steer the LLM toward correct writing conventions.

Formal Address Forms

rosetta's built-in registers include the culturally appropriate formal address for each language:

LanguageFormal FormRegister Instruction
GermanSieUse Sie-form for formal address
FrenchvousUse vous-form
RussianвыProfessional register with вы-form
TurkishsizProfessional register with siz-form
Korean합쇼체Formal Korean (합쇼체)
Japaneseです/ますPolite professional register (です/ます form)
PolishPan/PaniProfessional register with Pan/Pani form

Gender-Inclusive Writing

Each language card has a gender.inclusiveGuidance field with language-specific advice. This is injected into the LLM translation prompt separately from the register preset, so it applies consistently regardless of which formality preset the user chooses:

  • French: Écriture inclusive with interpunct notation (e.g., "Connecté·e")
  • German: Doppelpunkt notation (e.g., "Benutzer:innen")
  • Spanish: Gender-neutral restructuring preferred; slash notation (e.g., "usuario/a") as fallback

For languages without specific guidance in their card (e.g., Korean, conlangs), the system falls back to a generic rule: "prefer gender-neutral forms or the most inclusive option available."

RTL Script Requirements

Arabic, Hebrew, Persian, and Urdu registers all note right-to-left requirements: Ensure text reads naturally in RTL layout contexts.

Overriding Any Register

Every register is a config value — override it to match your project's voice:

{
"languages": {
"fr": {
"register": "Casual French. Use tu-form. Conversational blog tone. Gender-neutral when possible."
},
"de": {
"register": "Informal German. Use du-form. Tech startup voice."
}
}
}

See Configuration for the full config reference.


Adding a New Conlang

Step-by-step

  1. Choose a BCP-47 private-use code: Use the x- prefix (e.g., x-dothraki, x-valyrian).

  2. Add to your config:

{
"languages": {
"x-dothraki": {
"register": "Dothraki language. Use David J. Peterson's vocabulary from the Living Language Dothraki textbook. Harsh, direct tone. No articles, no verb 'to be'."
}
}
}
  1. (Optional) Add a script converter: If your conlang uses a non-Latin display script, add a converter in lib/scripts.js and register it in SCRIPT_CONVERTERS.

  2. Test: Run i18n-rosetta sync --dry to preview translations without writing files.

  3. Check the quality gate: The quality gate may need tuning for your conlang — particularly the requireNonLatin check if your conlang uses PUA characters.

:::note Conlang quality depends on LLM knowledge The LLM can only translate into a conlang it has seen in training data. Well-documented conlangs (Klingon, Sindarin, Dothraki) work well. Obscure or newly invented conlangs may produce inconsistent results. Use coaching data to improve quality. :::


See Also