Open Source Entity Resolution

Match CJK names that other tools miss

陳大文, Chan Tai Man, and 陈大文 are the same person. Phonetic-only engines can't see that. Dataline matches across scripts, romanizations, and character forms using three independent signals — phonetic, visual, and normalization — so real matches don't slip through.

loading matching engine...
Initializing WASM...

Why not just transliterate and match phonetically?

Because collapsing CJK to Latin loses information at every stage. Pinyin is many-to-one. NYSIIS merges distinct consonants. Tones disappear. OCR errors become invisible. Dataline scores three signals independently, so one weak dimension can't mask a strong match.

Multi-Signal Matching

Three independent signals per character pair — phonetic, visual, and normalization — combined after scoring. A high visual match isn't diluted by a low phonetic score.

Phonetic Distance

Pinyin and Jyutping distance scoring that preserves consonant distinctions (zh/z/j, ch/c/q) that NYSIIS collapses into a single bucket.

Visual Similarity

Stroke sequence comparison catches OCR and handwriting errors — characters that look nearly identical but sound completely different.

Simplified ↔ Traditional

Automatic normalization across character forms. 陳 and 陈 are recognized as the same entity without converting to phonetics first.

Smart Blocking

Blocking keys — first character, phonetic key, address district — cut O(n²) comparisons down to linear time. Scales to 10M+ records on a single machine.

Full MDM Pipeline

Tokenize → Block → Compare → Cluster → Survive. Declarative per-field survivorship rules build golden records from matched groups.

Built in Rust. Five commands to try it.

Free and open source under Apache 2.0. No accounts, no API keys, no dependencies beyond Cargo.

Quick Start
git clone https://github.com/digital-rain-tech/dataline.git
cd dataline
cargo build
cargo test
cargo bench