Open Source Entity Resolution

Match CJK names that other tools miss

陳大文, Chan Tai Man, and 陈大文 are the same person. Phonetic-only engines can't see that. Dataline matches across scripts, romanizations, and character forms using three independent signals — phonetic, visual, and normalization — so real matches don't slip through.

loading matching engine...
Initializing WASM...

Why not just transliterate and match phonetically?

Because collapsing CJK to Latin loses information at every stage. Pinyin is many-to-one. NYSIIS merges distinct consonants. Tones disappear. OCR errors become invisible. Dataline scores three signals independently, so one weak dimension can't mask a strong match.

Multi-Signal Matching

Three independent signals per character pair — phonetic, visual, and normalization — combined after scoring. A high visual match isn't diluted by a low phonetic score.

Phonetic Distance

Pinyin and Jyutping distance scoring that preserves consonant distinctions (zh/z/j, ch/c/q) that NYSIIS collapses into a single bucket.

Visual Similarity

Stroke sequence comparison catches OCR and handwriting errors — characters that look nearly identical but sound completely different.

Simplified ↔ Traditional

Automatic normalization across character forms. 陳 and 陈 are recognized as the same entity without converting to phonetics first.

Smart Blocking

Blocking keys — first character, phonetic key, address district — cut O(n²) comparisons down to linear time. Scales to 10M+ records on a single machine.

Full MDM Pipeline

Tokenize → Block → Compare → Cluster → Survive. Declarative per-field survivorship rules build golden records from matched groups.

Benchmarked, not just claimed

Criterion benchmarks on real CJK data. All numbers reproducible with cargo bench.

597K

multi-signal comparisons per second

8.3M

simplified↔traditional normalizations per second

~63s

to match 10M records on 16 cores

10M record scaling (first-character surname blocking)
CoresTimeHardware
1~17 minSingle-threaded
16~63 secCommodity server
64~16 secCloud instance

Multi-signal CJK comparison at ~1.7µs per pair. Rayon work-stealing parallelism. Assumes even surname distribution — real CJK data is skewed (common surnames like 陳 create larger blocks), so wall-clock times may vary. Composite blocking keys address this.

Built in Rust. Five commands to try it.

Free and open source under Apache 2.0. No accounts, no API keys, no dependencies beyond Cargo.

Quick Start
git clone https://github.com/digital-rain-tech/dataline.git
cd dataline
cargo build
cargo test
cargo bench