Match CJK names that other tools miss
陳大文, Chan Tai Man, and 陈大文 are the same person. Phonetic-only engines can't see that. Dataline matches across scripts, romanizations, and character forms using three independent signals — phonetic, visual, and normalization — so real matches don't slip through.
Why not just transliterate and match phonetically?
Because collapsing CJK to Latin loses information at every stage. Pinyin is many-to-one. NYSIIS merges distinct consonants. Tones disappear. OCR errors become invisible. Dataline scores three signals independently, so one weak dimension can't mask a strong match.
Multi-Signal Matching
Three independent signals per character pair — phonetic, visual, and normalization — combined after scoring. A high visual match isn't diluted by a low phonetic score.
Phonetic Distance
Pinyin and Jyutping distance scoring that preserves consonant distinctions (zh/z/j, ch/c/q) that NYSIIS collapses into a single bucket.
Visual Similarity
Stroke sequence comparison catches OCR and handwriting errors — characters that look nearly identical but sound completely different.
Simplified ↔ Traditional
Automatic normalization across character forms. 陳 and 陈 are recognized as the same entity without converting to phonetics first.
Smart Blocking
Blocking keys — first character, phonetic key, address district — cut O(n²) comparisons down to linear time. Scales to 10M+ records on a single machine.
Full MDM Pipeline
Tokenize → Block → Compare → Cluster → Survive. Declarative per-field survivorship rules build golden records from matched groups.
Built in Rust. Five commands to try it.
Free and open source under Apache 2.0. No accounts, no API keys, no dependencies beyond Cargo.
git clone https://github.com/digital-rain-tech/dataline.git cd dataline cargo build cargo test cargo bench