April 2026

Why Dataline is 10-100x faster than traditional MDM engines

Here is what a traditional master data management engine does when you ask it whether 陳大文 and Chan Tai Man are the same person: it converts the characters to pinyin ("chen"), runs NYSIIS to get a phonetic code, then string-compares the result against a phonetic code derived from the Latin name. Each step allocates a new string. Each step applies a rule engine. Each step throws away information.

Here is what Dataline does: it looks up stroke sequences in a HashMap, computes Euclidean distance on pre-indexed phoneme coordinates, checks a normalization dictionary, and returns three scores. Total time: 1.7 microseconds.

The 10-100x claim is not the result of some breakthrough algorithm or exotic hardware. It comes from four boring, structural decisions that compound on each other in the hot path. Here they are.

1. Direct character operations, no transliteration

The conventional pipeline is: character → romanization → phonetic code → string comparison. Three transformations, three allocations, three opportunities to lose information.

Dataline skips all intermediate string representations:

  • Visual signal: HashMap lookup for stroke sequences, then Levenshtein distance directly on the stroke arrays. The stroke sequences are borrowed from the dictionary (not copied). No strings allocated.
  • Phonetic signal: Array index for pinyin, then Euclidean distance on 2D phoneme coordinates. This is the DimSim approach: each syllable maps to a point in a learned 2D space where phonetically similar syllables are geometrically close. The comparison is pure arithmetic. No string comparison at all.
  • Normalization: Single HashMap lookup. Returns a boolean.

2. Compiled lookup tables, not rule engines

MDM engines love rule engines. Character transliteration involves cultural rules, special cases, context-dependent transformations. Each rule is a branch the CPU has to evaluate. Branch mispredictions are expensive. Rule engines are, by construction, branch-misprediction machines.

Dataline replaces rule chains with pre-compiled data:

OperationTraditionalDataline
Char → pronunciationRule-based transliterationrust-pinyin compiled lookup
Pronunciation → keyNYSIIS / Soundex rulesDimSim 2D coords (const arrays)
Char → visual formRadical decompositionStrokeDict (20,901 entries)
S↔T normalizationOpenCC rule engineNormDict (8K entries)

HashMap lookups are O(1). Rule engines are O(rules). When your character set has 20,901 entries, the difference is not theoretical.

3. Zero-allocation scoring

When you are doing 500 million pairwise comparisons, the cost of the comparison function dominates everything. And the single most expensive thing a comparison function can do in a tight loop is allocate heap memory.

The traditional pipeline allocates at every stage: the original character becomes a pinyin string (allocation), which becomes an NYSIIS code (allocation), which gets compared in a buffer (allocation). Three allocations per character per comparison. For a 3-character name, that is 9 allocations per pair. At 500 million pairs, that is 4.5 billion allocations.

Dataline's scoring path allocates almost nothing:

  • Stroke sequences are borrowed from the HashMap. No copy.
  • DimSim coordinates live in const arrayscompiled into the binary. No allocation.
  • Pinyin strings stay on the stack.
  • The only remaining heap allocation (the pinyin String itself) goes away when you pre-compute it during record ingestion, which is what we plan to do next.

Saving 100 nanoseconds per pair across 500 million pairs is 50 seconds of wall-clock time. That is not a micro-optimization. It is the difference between "run it in the nightly batch window" and "run it whenever you want."

4. Focused scope

Commercial MDM engines handle 200+ cultural naming conventions. Patronymics, compound surnames, generational suffixes, honorifics, transliterations across dozens of scripts. Each convention adds branching logic to the hot path, and the hot path is where you live when you are doing 500 million comparisons.

Dataline handles two script families: CJK and Latin. That is it. The script detector routes each pair to the right matcher:

  • Both CJK → Multi-signal matcher (1.7 µs/pair)
  • Both Latin → Jaro-Winkler (503 ns/pair)
  • Mixed → Romanization comparison (future work)

For a typical Hong Kong dataset (roughly 60% CJK, 40% Latin), the blended throughput on 16 cores matches 10 million records in about 40 seconds.

This is a product decision, not a technical limitation. We are not building Informatica. If you need to match Icelandic patronymics against Arabic transliterations, we are not your tool — and pretending otherwise would make us slower at the thing we are actually good at.

Scaling: Rayon, not Spark (with a caveat)

Dataline uses Rayon's work-stealing thread pool. No cluster coordination. No serialization overhead. Just fork-join on a single machine. Here are the projections assuming uniform surname distribution:

Records1 core16 cores64 cores
100K10s625ms156ms
1M100s6s1.5s
10M17min63s16s
100M+170min10min2.5min

But there is a caveat, and it matters enough to be honest about: CJK surname distribution is extremely skewed.

First-character blocking on surnames means every 陳 (Chan) gets compared against every other 陳. In a Hong Kong dataset, Chan is one of the most common surnames. The 陳 block alone can contain a disproportionate share of all records, producing a single partition that dwarfs everything else. While the other 15 cores finish their smaller blocks and go idle, core 0 is still grinding through the Chan comparisons. Rayon's work-stealing helps at the sub-block level, but the fundamental imbalance remains: parallelism does not scale linearly when one block is 50x larger than the average.

This is not a theoretical concern. It is something we have seen in practice. The table above assumes even distribution. Real wall-clock times on skewed CJK data will be worse than the projections, dominated by the largest surname block.

There are two complementary fixes. The first is composite blocking keys — surname initial + district, or surname initial + phone prefix — which splits the 陳 block into many smaller, parallelizable partitions. This is one of the planned optimizations (see below), and it is as much about fixing the parallelism skew as it is about reducing total comparisons.

The second is a technique that Augustin Chan developed while building MDM engines at Informatica: time-boxed chunk splitting. Instead of letting a single thread grind through the entire 陳-vs-陳 block (which could take orders of magnitude longer than the average block), you estimate how many comparisons can be completed in a fixed time window — say, 5 minutes — and chop the N² block into M×N chunks, where M is sized to fit that window. Each chunk goes back onto the work queue. The other cores, which would otherwise be idle waiting for the Chan block to finish, pick up these chunks and process them in parallel.

The result is that your wall-clock time is bounded by the chunk size, not by the largest block. A 陳 partition that would take 30 minutes on one thread becomes six 5-minute chunks spread across available cores. This is not something Dataline implements today, but it is a well-proven pattern for exactly this class of skew problem, and it composes nicely with Rayon's existing work-stealing.

The median large enterprise maintains 1-2 million customer records. Even with surname skew, Dataline handles that range on a single machine. The whole genre of "scaling entity resolution to billions with Spark" is solving a problem most MDM projects do not have.

Three optimizations we haven't done yet

The numbers above are on an unoptimized hot path. Three changes are planned that should compound nicely:

  • Pre-computed phonetic keys during record ingestion, instead of recomputing pinyin on every comparison. Expected: 2-3x faster on the phonetic signal.
  • Composite blocking keys (surname initial + district, or surname initial + phone prefix) instead of first-character-only blocking. Expected: 10-50x fewer candidate pairs, and critically, more even partition sizes for parallel scaling.
  • Time-boxed chunk splitting for oversized blocks. Estimate comparison throughput, chop blocks that exceed a time budget into M×N chunks, return them to the work queue. Eliminates the long-tail parallelism problem on skewed surname distributions.

Combined, these could bring the 10M record time on 16 cores from 63 seconds to single-digit seconds — even on surname-skewed CJK datasets. At which point the bottleneck shifts to I/O, and the matching engine is no longer the interesting part of the pipeline.

Which is exactly where you want the matching engine to be: fast enough that you stop thinking about it.

All numbers are reproducible. Clone it, bench it, argue with us on GitHub.

git clone https://github.com/digital-rain-tech/dataline.git
cd dataline
cargo bench

Apache 2.0. Free forever.