Question 1

What is Dataline?

Accepted Answer

Dataline is an open-source master data matching engine purpose-built for Chinese, Japanese, and Korean (CJK) scripts. It resolves mixed-script customer records — for example 陳大文, Chan Tai Man, and 陈大文 — that existing MDM tools handle poorly because they transliterate CJK to Latin and match phonetically. Dataline scores three independent signals per character pair: phonetic distance (pinyin/jyutping), visual similarity (stroke sequences), and simplified ↔ traditional normalization. Built in Rust, it runs at 597,000 multi-signal comparisons per second and matches 10 million records in roughly 63 seconds on a 16-core commodity server. Released under Apache 2.0 by Digital Rain Tech.

Question 2

Why not just transliterate CJK names and match phonetically?

Accepted Answer

Transliterating CJK to Latin loses information at every stage. Pinyin is many-to-one (different characters share the same romanization), NYSIIS collapses Chinese consonant distinctions like zh/z/j and ch/c/q into a single bucket, tones are discarded entirely, and OCR errors that swap visually similar characters become invisible to phonetic-only matchers. Dataline scores phonetic, visual, and normalization signals independently, so a strong match in one dimension is not diluted by a weak score in another.

Question 3

How fast is Dataline?

Accepted Answer

Criterion benchmarks measure 597,000 multi-signal CJK comparisons per second and 8.3 million simplified↔traditional normalizations per second. With first-character surname blocking, Dataline matches 10 million records in approximately 17 minutes single-threaded, 63 seconds on a 16-core commodity server, and 16 seconds on a 64-core cloud instance. All numbers are reproducible with cargo bench.

Question 4

What signals does Dataline use to match characters?

Accepted Answer

Three independent signals per character pair: phonetic distance using pinyin and jyutping (preserving consonant distinctions that NYSIIS collapses), visual similarity using stroke sequence comparison (catches OCR and handwriting errors), and normalization across simplified and traditional forms (recognizes 陳 and 陈 as the same entity without going through phonetics). Signals are combined after scoring rather than before, preventing one weak dimension from masking a strong match.

Question 5

How does Dataline compare to Splink, Dedupe, and Zingg?

Accepted Answer

Splink, Dedupe, and Zingg are mature probabilistic record linkage frameworks for Latin-script data. None of them include CJK-specific matchers — no visual or stroke-based similarity, no simplified↔traditional normalization. Dataline is the only open-source engine purpose-built for CJK scripts, but it currently lacks the probabilistic models and active learning that the Latin-focused tools have. See dataline.dev/compare for a full feature matrix.

核心数	耗时	硬件
1	~17 min	单线程
16	~63 sec	普通服务器
64	~16 sec	云实例

匹配中日韩姓名，其他工具做不到

为什么不直接转写成拉丁字母再做语音匹配？

多信号匹配

语音距离

字形相似度

简繁互转

智能分块

完整 MDM 管道

有基准测试，不只是口头声称

Rust 构建，五条命令即可体验。

匹配中日韩姓名， 其他工具做不到