Question 1

What is Dataline?

Accepted Answer

Dataline is an open-source master data matching engine purpose-built for Chinese, Japanese, and Korean (CJK) scripts. It resolves mixed-script customer records — for example 陳大文, Chan Tai Man, and 陈大文 — that existing MDM tools handle poorly because they transliterate CJK to Latin and match phonetically. Dataline scores three independent signals per character pair: phonetic distance (pinyin/jyutping), visual similarity (stroke sequences), and simplified ↔ traditional normalization. Built in Rust, it runs at 597,000 multi-signal comparisons per second and matches 10 million records in roughly 63 seconds on a 16-core commodity server. Released under Apache 2.0 by Digital Rain Tech.

Question 2

Why not just transliterate CJK names and match phonetically?

Accepted Answer

Transliterating CJK to Latin loses information at every stage. Pinyin is many-to-one (different characters share the same romanization), NYSIIS collapses Chinese consonant distinctions like zh/z/j and ch/c/q into a single bucket, tones are discarded entirely, and OCR errors that swap visually similar characters become invisible to phonetic-only matchers. Dataline scores phonetic, visual, and normalization signals independently, so a strong match in one dimension is not diluted by a weak score in another.

Question 3

How fast is Dataline?

Accepted Answer

Criterion benchmarks measure 597,000 multi-signal CJK comparisons per second and 8.3 million simplified↔traditional normalizations per second. With first-character surname blocking, Dataline matches 10 million records in approximately 17 minutes single-threaded, 63 seconds on a 16-core commodity server, and 16 seconds on a 64-core cloud instance. All numbers are reproducible with cargo bench.

Question 4

What signals does Dataline use to match characters?

Accepted Answer

Three independent signals per character pair: phonetic distance using pinyin and jyutping (preserving consonant distinctions that NYSIIS collapses), visual similarity using stroke sequence comparison (catches OCR and handwriting errors), and normalization across simplified and traditional forms (recognizes 陳 and 陈 as the same entity without going through phonetics). Signals are combined after scoring rather than before, preventing one weak dimension from masking a strong match.

Question 5

How does Dataline compare to Splink, Dedupe, and Zingg?

Accepted Answer

Splink, Dedupe, and Zingg are mature probabilistic record linkage frameworks for Latin-script data. None of them include CJK-specific matchers — no visual or stroke-based similarity, no simplified↔traditional normalization. Dataline is the only open-source engine purpose-built for CJK scripts, but it currently lacks the probabilistic models and active learning that the Latin-focused tools have. See dataline.dev/compare for a full feature matrix.

核心數	耗時	硬體
1	~17 min	單執行緒
16	~63 sec	一般伺服器
64	~16 sec	雲端執行個體

匹配中日韓姓名，其他工具做不到

為什麼不直接轉寫成拉丁字母再做語音匹配？

多信號匹配

語音距離

字形相似度

簡繁互轉

智慧分塊

完整 MDM 管道

有基準測試，不只是口頭聲稱

Rust 建構，五條命令即可體驗。

匹配中日韓姓名， 其他工具做不到