Clean & Deduplicate 1M Rows in Under a Second

The Export From Hell

You exported 1 million contacts from Salesforce. You already know it is going to be ugly before you even open the file. Half the records have duplicate entries because your sales team entered the same lead from three different trade shows. Phone numbers appear in at least twelve different formats: (555) 123-4567, 555.123.4567, +1-555-123-4567, 5551234567, and several creative variations that look like someone was testing their keyboard. Email addresses have typos that would be funny if they were not costing you money — "john@gmial.com", "sarah@yaho.com", "mike@outloo.com". Names swing wildly between ALL CAPS, all lowercase, and mixed case. Date columns are a war zone of MM/DD/YYYY and DD/MM/YYYY fighting for dominance.

This is the reality of production data. It is never clean. It is never consistent. And the bigger the dataset, the worse it gets, because more sources and more humans touched it along the way.

Most tools tell you to split the file into manageable pieces, clean each batch separately, and stitch the results back together. Some suggest writing a Python script and running it overnight. Others recommend hiring a data analyst to spend a week on it.

We process the whole thing — all 5 operations, all 1 million rows — in 153 milliseconds.

What "Cleaning" Actually Means at 1 Million Rows

When we say "cleaning," we are not talking about one simple find-and-replace operation. We mean five distinct, computationally intensive transformations applied to every single record in the dataset. Each operation solves a different class of data quality problem, and together they cover the vast majority of issues found in real-world CRM exports, marketing lists, and transactional databases.

Name normalization converts inconsistent casing to proper title case while correctly handling edge cases like "McDonald", "O'Brien", "de la Cruz", and "van der Berg". It trims leading and trailing whitespace, collapses multiple spaces, and ensures every name in your dataset follows the same format. At 10,000 rows, this operation takes 2.23 milliseconds.

Phone number formatting converts every variation to E.164 international standard, which is the format required by Twilio, Facebook Ads, Google Ads, Mailchimp, and virtually every modern platform that accepts phone numbers. Parentheses, dots, dashes, spaces, and country code variations are all handled. At 10,000 rows, this takes 1.13 milliseconds.

Email validation and typo correction catches the common domain misspellings that plague every contact database: gmial.com becomes gmail.com, yaho.com becomes yahoo.com, hotmal.com becomes hotmail.com, outloo.com becomes outlook.com. It also validates syntax and flags truly invalid addresses. At 10,000 rows, this takes 2.28 milliseconds.

Deduplication identifies and removes duplicate records based on configurable key columns. At 10,000 rows, this takes 1.07 milliseconds. And date standardization resolves ambiguous date formats into a single, consistent standard — ISO 8601 or whatever your downstream system requires. At 10,000 rows, this takes 0.08 milliseconds.

All five operations run in a single pass. You do not clean names first, then come back for phones, then run a separate dedup step. One pipeline, one execution, one result.

The Numbers

These are real benchmarks from production hardware, not theoretical estimates or best-case marketing numbers. Every timing was measured under realistic conditions with representative data.

Dataset Size	Cleaning Time	Including DB Writes
10,000 rows	~6.8ms	<1 second
100,000 rows	~15ms	~2 seconds
1,000,000 rows	153ms	~20 seconds
10,000,000 rows	~1.5s	~3.5 minutes

The per-step breakdown at 10,000 rows shows where the time actually goes:

Operation	Time (10K rows)
Normalize names	2.23ms
Format phone numbers (E.164)	1.13ms
Fix email typos	2.28ms
Remove duplicates	1.07ms
Standardize dates	0.08ms

NoSheet also achieves a 9.1x speedup from parallel processing across 16 CPU cores. The cleaning engine distributes work automatically — you do not need to configure anything. Upload your file, select your cleaning operations, and the system figures out the fastest way to process it.

Why Speed Matters More Than You Think

"153 milliseconds versus 20 seconds — who cares? I can wait 20 seconds." If data cleaning were a one-time event, you would be right. But it is not. Data cleaning is an iterative process. You clean, you inspect the results, you adjust your parameters, and you clean again. Every round-trip through the pipeline matters.

When cleaning takes minutes, people run it once and accept whatever comes out. When cleaning takes milliseconds, people experiment. They try different dedup strategies, adjust the email typo thresholds, check the edge cases in name normalization. The result is not just faster cleaning — it is better cleaning.

Speed also matters for workflow completion rates. UX research consistently shows that latency kills engagement. Every second of delay increases the probability that a user abandons the task. For marketing teams cleaning a list before a campaign launch, there is a real window: the campaign goes out at 9 AM tomorrow, and the list needs to be clean tonight. A tool that processes the list in under a second removes the anxiety entirely.

What About Google Sheets and Excel?

Google Sheets has a hard limit of 10 million cells. At 60 columns, that is effectively 167,000 rows. You cannot even open a million-row dataset. And within the limits that Google Sheets does support, performance degrades rapidly. A VLOOKUP across 100,000 rows takes 15 to 30 seconds to execute. Formula recalculation at 500,000 rows takes 30 to 90 seconds, during which the entire interface is frozen and unresponsive.

Excel raises the row limit to 1,048,576, but that is still a hard cap — you cannot add a single row beyond it. And in practice, Excel becomes unusable well before you hit the limit. Any workbook with formulas starts freezing above 500,000 rows. Sorting and filtering become multi-minute operations. Auto-save can take over a minute, during which you cannot interact with the file.

NoSheet has no row limit. We have tested and benchmarked 10 million rows (3.5 minutes including database writes), and enterprise customers process even larger datasets. The architecture was designed from the ground up for datasets that spreadsheets cannot touch.

But Does It Actually Deduplicate Correctly?

Speed means nothing if the deduplication misses duplicates. The most common failure mode for dedup tools is what we call "boundary blindness" — when a tool processes data in segments and fails to catch duplicates that appear in different segments. Row 50,000 is a duplicate of row 950,000, but because they were processed separately, both survive.

NoSheet's dedup engine maintains global state across the entire dataset. It does not matter whether a duplicate appears in the first thousand rows or the last thousand rows. Every row is checked against a complete, continuously-updated registry of seen values. This is how we catch duplicates that simpler tools miss.

Deduplication also accounts for variations that look different but represent the same record. "John Smith" and "john smith" and "JOHN SMITH" and " John Smith " are all treated as the same entry when name normalization runs before the dedup step. This is why running all five cleaning operations in a single pipeline is critical — normalization feeds deduplication, and the order matters.

From Export to Campaign-Ready in Under a Second

The traditional workflow for cleaning a million-row export involves multiple tools, multiple passes, and multiple hours. Export from your CRM. Open in Excel (wait 3 minutes for it to load). Try to deduplicate (realize Excel does not have a native bulk dedup function). Export to a Python script. Debug the script. Run it. Wait. Import the results back. Spot-check. Find issues. Repeat.

With NoSheet, the workflow is: upload your CSV, select your cleaning operations, click run, and download the result. The entire process — from messy export to campaign-ready dataset — takes less time than it took you to read this sentence.

If your data needs cleaning before an email campaign, SMS blast, or ad audience upload, check out our CSV Cleaner, deduplication tool, and our guide on removing duplicates from CSV files.

Clean and Deduplicate 1 Million Rows in Under a Second