Question 1

What is Dataline?

Accepted Answer

Dataline is an open-source master data matching engine purpose-built for Chinese, Japanese, and Korean (CJK) scripts. It resolves mixed-script customer records — for example 陳大文, Chan Tai Man, and 陈大文 — that existing MDM tools handle poorly because they transliterate CJK to Latin and match phonetically. Dataline scores three independent signals per character pair: phonetic distance (pinyin/jyutping), visual similarity (stroke sequences), and simplified ↔ traditional normalization. Built in Rust, it runs at 597,000 multi-signal comparisons per second and matches 10 million records in roughly 63 seconds on a 16-core commodity server. Released under Apache 2.0 by Digital Rain Tech.

Question 2

Why not just transliterate CJK names and match phonetically?

Accepted Answer

Transliterating CJK to Latin loses information at every stage. Pinyin is many-to-one (different characters share the same romanization), NYSIIS collapses Chinese consonant distinctions like zh/z/j and ch/c/q into a single bucket, tones are discarded entirely, and OCR errors that swap visually similar characters become invisible to phonetic-only matchers. Dataline scores phonetic, visual, and normalization signals independently, so a strong match in one dimension is not diluted by a weak score in another.

Question 3

How fast is Dataline?

Accepted Answer

Criterion benchmarks measure 597,000 multi-signal CJK comparisons per second and 8.3 million simplified↔traditional normalizations per second. With first-character surname blocking, Dataline matches 10 million records in approximately 17 minutes single-threaded, 63 seconds on a 16-core commodity server, and 16 seconds on a 64-core cloud instance. All numbers are reproducible with cargo bench.

Question 4

What signals does Dataline use to match characters?

Accepted Answer

Three independent signals per character pair: phonetic distance using pinyin and jyutping (preserving consonant distinctions that NYSIIS collapses), visual similarity using stroke sequence comparison (catches OCR and handwriting errors), and normalization across simplified and traditional forms (recognizes 陳 and 陈 as the same entity without going through phonetics). Signals are combined after scoring rather than before, preventing one weak dimension from masking a strong match.

Question 5

How does Dataline compare to Splink, Dedupe, and Zingg?

Accepted Answer

Splink, Dedupe, and Zingg are mature probabilistic record linkage frameworks for Latin-script data. None of them include CJK-specific matchers — no visual or stroke-based similarity, no simplified↔traditional normalization. Dataline is the only open-source engine purpose-built for CJK scripts, but it currently lacks the probabilistic models and active learning that the Latin-focused tools have. See dataline.dev/compare for a full feature matrix.

Cores	Time	Hardware
1	~17 min	Single-threaded
16	~63 sec	Commodity server
64	~16 sec	Cloud instance

Match CJK names that other tools miss

Why not just transliterate and match phonetically?

Multi-Signal Matching

Phonetic Distance

Visual Similarity

Simplified ↔ Traditional

Smart Blocking

Full MDM Pipeline

Benchmarked, not just claimed

Built in Rust. Five commands to try it.