Healthcare Data
How to Find PHI in Spreadsheets Automatically
Protected Health Information hides in places you would never expect. Notes fields, file name columns, free-text entries, and custom columns all harbor PHI that manual review misses. Here is how to detect it automatically before it becomes a breach.
What Counts as PHI: The 18 HIPAA Identifiers
Before you can detect PHI, you need to know exactly what qualifies. HIPAA defines 18 categories of identifiers that, when associated with health information, constitute Protected Health Information. Here is the complete list:
- Names (full or partial)
- Geographic data smaller than a state (street address, city, ZIP code)
- Dates related to an individual (birth date, admission date, discharge date, death date, and all ages over 89)
- Phone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers (MRN)
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers and serial numbers
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers (fingerprints, voiceprints)
- Full-face photographs and comparable images
- Any other unique identifying number, characteristic, or code
The critical nuance is that these become PHI only when they are associated with health information. A standalone list of phone numbers is not PHI. But a spreadsheet that contains phone numbers alongside appointment dates, provider names, or any health-related data is PHI, and every identifier in that spreadsheet is protected.
The 18th category, "any other unique identifying number," is a catch-all that covers internal patient IDs, custom reference numbers, and any code that could be used to re-identify an individual. This makes de-identification particularly challenging, because even a seemingly innocuous internal ID can constitute PHI if it maps back to a patient record.
Why PHI Hides in Unexpected Places
The obvious columns are easy to identify. A column labeled "Patient Name" or "SSN" clearly contains PHI. The danger comes from the columns that do not advertise their contents.
Free-Text Notes Fields
Notes fields are the single most dangerous column in any healthcare spreadsheet. A staff member might write "Called patient John Smith at 555-0123 about his diabetes follow-up" in a notes field. That single cell now contains a name, phone number, and diagnosis. Notes fields bypass every structured data control because their content is unpredictable and varies from record to record. One row might contain "No issues" while the next contains a paragraph of clinical information.
File Name and Reference Columns
Many organizations use naming conventions that embed identifiers in file names or reference codes. A column might contain values like "Smith_John_123-45-6789_MRI_Report.pdf" where the Social Security number is part of the file naming convention. Or a reference number like "MRN-2024-0015847" that directly encodes the medical record number. These patterns are invisible if you only examine column headers.
Custom and Calculated Columns
When analysts create custom columns for reports, they sometimes concatenate identifiers in ways that create new PHI. A column labeled "Patient Key" might contain a combination of first initial, last name, and date of birth (JSmith19850315). A "Sort Order" column might encode ZIP code and age. These derived identifiers are still PHI because they can be used to re-identify individuals.
Exported Metadata
When data is exported from EHR systems, the export often includes metadata columns that the requesting user did not explicitly ask for. Audit timestamps, user IDs of treating providers, system-generated reference numbers, and transaction IDs can all contain or lead to PHI. These columns tend to have cryptic names like "SYS_REF_ID" or "AUDIT_TS" that are easy to overlook.
Why Manual Review Falls Short
Many organizations rely on manual review to detect PHI in spreadsheets. A compliance officer opens the file, scans the column headers, spot-checks a few rows, and decides whether the data contains PHI. This approach fails for several reasons.
Volume. A spreadsheet with 50 columns and 10,000 rows contains 500,000 individual cells. No human can meaningfully review that volume of data. Even a diligent reviewer will sample a fraction of the cells, missing PHI that appears only in specific rows.
Pattern blindness. Humans are poor at recognizing patterns across large datasets. A Social Security number formatted as 123456789 (no dashes) in a column labeled "Reference ID" will not register as an SSN to most reviewers. A date of birth embedded in a longer string like "DOB03151985-REF" requires pattern matching that humans do not perform reliably at scale.
Inconsistency. Different reviewers apply different standards. One reviewer might flag email addresses as PHI while another does not. One might catch SSNs without dashes while another only recognizes the dashed format. Manual review produces inconsistent results that depend on who performs it and how much time they spend.
No audit trail. Manual review typically produces no documentation of what was checked, what was found, and what was cleared. When an auditor asks how you determined that a dataset was free of PHI, "someone looked at it" is not a defensible answer.
Regex Patterns for Common PHI Types
For organizations that want to build basic PHI detection into their processes, regular expressions can catch the most common patterns. Here are the standard patterns for the highest-risk identifier types:
// Social Security Number (with or without dashes)
\b\d{3}-?\d{2}-?\d{4}\b
// Medical Record Number (common formats)
\b(MRN|MR#|MedRec)[:\s-]?\d{5,10}\b
// Phone Number (multiple formats)
\b(\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})\b
// Email Address
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
// Date of Birth (multiple formats)
\b(0[1-9]|1[0-2])[/-](0[1-9]|[12]\d|3[01])[/-](19|20)\d{2}\b
// Credit Card Number
\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b
// IP Address
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
// ZIP Code (5-digit and ZIP+4)
\b\d{5}(-\d{4})?\b
These patterns provide a starting point, but they have significant limitations. The SSN pattern will also match other 9-digit numbers. The phone pattern will match some non-phone numeric sequences. ZIP code detection generates false positives on any 5-digit number. And none of these patterns can detect names, which are one of the 18 HIPAA identifiers and arguably the hardest to detect reliably without natural language processing.
Building a comprehensive regex-based scanner requires dozens of patterns, careful tuning of false positive rates, and ongoing maintenance as new identifier formats emerge. For most organizations, this is not a practical approach.
How NoSheet Detects PHI Automatically
NoSheet's automatic PII and PHI detection goes beyond simple regex matching. When you upload a spreadsheet, every column is scanned using a combination of pattern matching, contextual analysis, and column-header heuristics to identify sensitive data with high accuracy and low false positive rates.
Pattern detection identifies SSNs, credit card numbers, phone numbers, email addresses, medical record numbers, and dates of birth regardless of formatting. The scanner recognizes these identifiers with dashes, spaces, dots, or no separators at all. It also detects identifiers embedded within longer strings, catching the file-name and reference-code patterns described above.
Contextual analysis examines the relationship between columns to determine whether data constitutes PHI. A column of dates next to a column of names is flagged as potential PHI (dates of birth associated with individuals), while a standalone column of dates labeled "Invoice Date" is treated differently. This context-aware approach dramatically reduces false positives compared to pure pattern matching.
Column-header heuristics recognize common naming conventions for sensitive fields. Headers like "DOB," "SSN," "MRN," "Patient ID," "Member ID," and their many variations are recognized automatically, even when abbreviated or misspelled.
What to Do When PHI Is Found
Detecting PHI is only the first step. What you do next depends on your use case and your organization's policies. There are three standard responses:
Encrypt
If you need to keep the PHI but want to protect it during processing, encryption is the answer. NoSheet applies cell-level fully homomorphic encryption to flagged fields, allowing cleaning operations to proceed on the encrypted data without exposing the plaintext values. This is the right approach when you need to clean the data but cannot afford to expose it during the process. For a deeper explanation of this technology, read our guide on what encrypted data cleaning is and how it works.
Redact
If the PHI is not needed for your intended use, redaction removes it permanently. SSNs in a marketing outreach list should be redacted because there is no legitimate reason for marketing to have Social Security numbers. Redaction replaces the sensitive value with a placeholder like "[REDACTED]" or removes the column entirely. This is the right approach for datasets being shared with teams that do not need access to the underlying identifiers.
Quarantine
When PHI is found in a dataset that should not contain it, quarantine is the safest response. The flagged records are separated into a restricted file that requires elevated permissions to access, while the remaining records proceed through normal processing. Quarantine is appropriate when PHI appears in unexpected places (notes fields, reference columns) and needs to be reviewed by a compliance officer before any further action is taken.
Building PHI Detection Into Your Workflow
PHI detection should not be a one-time audit. It should be an automated checkpoint in every data workflow that handles healthcare information. Every time data is exported from an EHR, imported into an analytics tool, shared with a vendor, or used for outreach, it should pass through PHI detection first.
The simplest implementation is to make PHI scanning the first step in any data cleaning workflow. Upload your file to NoSheet, review the PHI detection results, decide on encrypt/redact/quarantine for each flagged field, and then proceed with your cleaning operations. This adds seconds to your workflow and eliminates the risk of accidentally processing or sharing unprotected PHI.
For organizations building automated data pipelines, NoSheet's detection capabilities can be integrated as a validation gate. Data that passes PHI screening proceeds automatically. Data that triggers PHI flags is routed for review. This approach ensures that no dataset slips through without screening, even when processed by automated systems.
For a complete overview of compliant data cleaning tools and practices, see our guide on HIPAA compliant data cleaning tools in 2026.
Scan Your Spreadsheet for PHI in Seconds
Upload any spreadsheet and let NoSheet automatically detect SSNs, medical record numbers, phone numbers, emails, and other PHI across every column.
Scan for PHI Now