A shared space for
Dzongkha and English.
DzoSEM is a bilingual sentence-embedding model that places Dzongkha and English meaning side by side, opening Bhutan's written and oral heritage to anyone who can phrase the question.
A language at the edge of digital legibility.
Six pressures shape the moment Dzongkha is in - a language carrying centuries of Himalayan knowledge,and yet lacking searchibility.
A low-resource language
Unlike English or Mandarin, Dzongkha lacks large digitised corpora, translation tooling, and core NLP infrastructure.
Roughly 600,000 speakers
A uniquely Bhutanese problem - too small for global model coverage to solve, too vital to leave unattended.
Written meets spoken
A wide gap separates classical written orthography from the modern spoken form, complicating computational modelling.
Economic gravity
The difficulty of Dzongkha and the pull of English in high-paying work nudges the generation away from the mother tongue.
Heritage at risk
Most manuscripts, court records, and Himalayan medical knowledge survive in Dzongkha, but remain difficult to access and understand.
Digitised, yet out of reach
His Majesty's digital initiative has scanned vast archives, but they remain hard to search and harder to read.
To graduate Dzongkha from a low-resource language into a stable one.
Four places this becomes useful.
A shared embedding space lacking infrastructure - actual value shows up where people meet the archive.
Historical archives & land records
Sifting through thousands of pages of monastic intelligence, old property records, and legal texts in milliseconds using everyday English phrasing.
Sowa Rigpa & medicine
Mapping topics and surfacing patterns inside centuries of Himalayan medical knowledge currently locked in complex traditional script.
A bridge for the next generation
Giving a generation that increasingly thinks and texts in English a frictionless way to explore their mother tongue, keeping the language alive and relevant.
Limitless downstream potential
By creating a shared embedding space, we unlock the potential for advanced AI applications. It's the foundational infrastructure needed to build reliable AI infrastructure for Bhutanese languages.
Search in English. Read in Dzongkha.
Type an English idea — the model returns the closest Dzongkha documents by meaning, not by exact keyword matching.
Building Enterprise AI For
Himalayan Languages.
We develop Gen AI solutions designed for domain-specific and multilingual landscapes.
Semantic Search & Infra
Building the core retrieval engines and robust managed infrastructure necessary to deploy large-scale semantic search for specialized Bhutanese datasets.
Multilingual Voice AI
Expanding our embedding space beyond text. We are mapping out semantic search for spoken Tshangla, Bumthangkha, and Sharchopkha to capture oral traditions.
Vision and OCR AI
Developing a highly accurate, Dzongkha-optimized vision model capable of reading and digitizing historical manuscripts, pechas, and modern documents from any era.
Build the future of Bhutanese Semantic Spaces together.
If you are interested in contributing parallel data, evaluating models, or exploring custom search pipelines, let's connect.