Benchmarks
Data Cleaning Benchmark: Dedup and Format 1M Rows
The first real benchmark for data cleaning tools. 1 million rows, 5 operations, 153 milliseconds. Full methodology and results.
Nobody Benchmarks Data Cleaning
Every data tool vendor says they are "fast." Visit any landing page in the data cleaning space and you will find words like "lightning-fast processing," "blazing speed," and "instant results." What you will not find is a single number. No millisecond timings. No rows-per-second throughput. No methodology. No reproducible test.
This is strange. In every other performance-sensitive domain, benchmarks are the norm. Database vendors publish TPC benchmarks. Web frameworks have the TechEmpower benchmarks. Programming languages have the Computer Language Benchmarks Game. But data cleaning? The industry runs on vibes.
We decided to change that. We ran a comprehensive benchmark of NoSheet's data cleaning pipeline with a real dataset at real scale, measured every step, and published the results. No asterisks, no "up to" qualifiers, no best-case scenarios. These are median timings from production hardware under realistic conditions.
Methodology
The benchmark dataset consists of 1 million rows with 5 columns: name, phone, email, date, and score. The data is intentionally messy — it was designed to represent the kind of real-world data quality issues that businesses encounter daily.
Names appear in random casing (ALL CAPS, all lowercase, mIxEd CaSe) with extra whitespace. Phone numbers use twelve different formats including parentheses, dots, dashes, spaces, and various country code prefixes. Email addresses include intentional domain typos (gmial.com, yaho.com, hotmal.com, outloo.com). Dates mix MM/DD/YYYY, DD/MM/YYYY, YYYY-MM-DD, and several other regional formats. Approximately 15% of rows are intentional duplicates with slight variations in formatting.
Five cleaning operations are applied in a single pipeline:
- Name normalization — Convert to proper title case, trim whitespace, handle special cases (McDonald, O'Brien, de la Cruz)
- Phone formatting — Convert all variations to E.164 international standard (+15551234567)
- Email typo correction — Fix common domain misspellings, validate syntax, flag invalid addresses
- Deduplication — Remove duplicate rows based on normalized key columns with global state tracking
- Date standardization — Convert all date formats to ISO 8601 (YYYY-MM-DD)
Timings were measured on production hardware with 16 CPU cores available. Each benchmark was run multiple times and the median was reported. Warm-up runs were excluded. Garbage collection pauses, if any, were included in the reported times. These numbers represent what users actually experience, not theoretical throughput.
Results: Per-Step Timings
First, the per-step breakdown at 10,000 rows. This shows where the computational work actually happens in the cleaning pipeline:
| Operation | Time (10K rows) | Per Row |
|---|---|---|
| Email typo correction | 2.28ms | 228ns |
| Name normalization | 2.23ms | 223ns |
| Phone formatting (E.164) | 1.13ms | 113ns |
| Deduplication | 1.07ms | 107ns |
| Date standardization | 0.08ms | 8ns |
Email correction and name normalization are the most expensive operations because they involve the most complex transformations — pattern matching against known typo databases, unicode-aware casing rules, and multi-word special case handling. Date standardization is the cheapest because date parsing, while tricky for humans, is computationally straightforward once you have a robust parser.
Results: Scale Timings
Now the numbers that matter most: how the full 5-operation pipeline performs as dataset size increases.
| Rows | Cleaning Time | With DB Writes | Throughput |
|---|---|---|---|
| 10,000 | ~6.8ms | <1s | ~1.47M rows/sec |
| 100,000 | ~15ms | ~2s | ~6.7M rows/sec |
| 1,000,000 | 153ms | ~20s | ~6.5M rows/sec |
| 10,000,000 | ~1.5s | ~3.5 min | ~6.7M rows/sec |
Two things stand out. First, the cleaning operation itself scales nearly linearly — throughput remains consistent at approximately 6.5 million rows per second regardless of dataset size. Second, the database write step dominates at larger scales. At 1 million rows, cleaning takes 153ms but saving results to the database takes the remaining ~20 seconds. This is a deliberate design choice: we persist cleaned data immediately so it is available for querying, collaboration, and downstream integrations without requiring the user to manage file exports.
The Parallelism Factor
NoSheet's cleaning engine achieves a 9.1x speedup from parallel processing across 16 CPU cores. This is a measured number, not a theoretical maximum. The speedup is less than the ideal 16x because some operations (particularly deduplication) require coordination across the dataset to maintain global correctness. A tool that achieves 16x speedup on deduplication is almost certainly missing cross-boundary duplicates.
The 9.1x figure represents the sweet spot between parallelism and correctness. Every duplicate is caught, every email typo is fixed, and the work is distributed across all available cores. No configuration is required — the system automatically adapts to the available hardware.
How Does This Compare to Other Approaches?
Since no other data cleaning vendor publishes benchmarks, we compared against the approaches that data teams actually use in practice. These are estimates based on our testing and publicly available performance data for each tool category. Your mileage will vary based on specific configurations and data characteristics.
| Approach | Estimated Time (1M rows) | Notes |
|---|---|---|
| Manual Excel formulas | 30+ minutes | Assumes file even opens; crashes likely |
| Python (pandas) | 8-15 seconds | Basic ops; custom dedup adds complexity |
| Cloud ETL (Glue, Dataflow) | Minutes to provision + seconds | Cold start overhead dominates |
| NoSheet | 153ms | All 5 operations, single pipeline |
The Python comparison deserves context. A skilled data engineer can write a pandas script that processes 1 million rows in 8 to 15 seconds for basic operations like trimming whitespace and converting case. But that script does not include robust E.164 phone formatting with international number detection, typo-aware email domain correction, or globally-consistent deduplication. Adding those features requires additional libraries, custom logic, and testing. The 8-15 second estimate is for a basic pipeline. A production-quality pipeline with all five of our benchmark operations would take significantly longer to write, debug, and maintain.
Cloud ETL platforms like AWS Glue or Google Dataflow can process large datasets efficiently once they are running, but the provisioning overhead — spinning up workers, loading data into the cluster, initializing the job — adds minutes before any data is actually processed. For recurring batch jobs this overhead is acceptable. For interactive data cleaning where a user is waiting for results, it is not.
What 153 Milliseconds Actually Feels Like
To put 153 milliseconds in perspective: a human eye blink takes 100 to 400 milliseconds. A browser fully rendering a complex webpage takes 200 to 500 milliseconds. A Google search query takes about 200 milliseconds. Typing a single character on a keyboard and seeing it appear on screen takes about 50 to 100 milliseconds.
In other words, NoSheet cleans 1 million rows in less time than it takes your browser to render the results page. The data is clean before your eyes finish scanning the progress indicator. At this speed, the user experience shifts from "submit a job and wait for results" to "instant feedback." You click a button and the data is simply done.
This matters more than most people realize. When processing takes seconds or minutes, users submit one cleaning job and accept whatever comes out. They do not iterate. They do not experiment with different dedup strategies or try alternative normalization rules. The latency tax discourages exploration.
When processing is instantaneous, behavior changes. Users try three different dedup key combinations to see which one catches the most duplicates. They toggle email typo correction on and off to compare results. They experiment with date format detection settings. The result is not just faster cleaning — it is more thorough cleaning, because the cost of experimentation is zero.
Reproducibility and Transparency
We believe benchmark transparency matters. The numbers in this article are not cherry-picked best-case results. They are median timings from multiple runs on production hardware with realistic data. We ran each benchmark a minimum of ten times and reported the median to account for variance from system-level scheduling and memory management.
If you want to verify these numbers yourself, sign up and upload a million-row dataset. The cleaning time is displayed in the interface after each operation. We do not hide it or round it. You see exactly how long your specific data took to process.
For a walkthrough of using NoSheet's cleaning pipeline on your own data, start with our guide to cleaning 1M rows in under a second, or jump straight into the CSV Cleaner and deduplication tool.
Run the Benchmark On Your Data
Upload any CSV and see exactly how fast NoSheet processes it. Real numbers, your data, no surprises.
Try It Free