PII Detection for SaaS: A Practical Field Guide

Updated May 2, 2026 · ~10 minute read

Your application accepts text from users. That text contains personally identifying information whether you wanted it to or not. This guide walks through what counts as PII (it’s broader than you think), where automated detection works and where it fails, the relevant compliance triggers in GDPR and CCPA, and a small architecture pattern that lets you redact PII before it lands in long-term storage.

What actually counts as PII

The category of "personally identifying information" is wider than the obvious things (names, email addresses, phone numbers). Both GDPR and CCPA treat the following as personal data:

Names and email addresses
Phone numbers
Postal addresses, including partial addresses (e.g. "the apartment above the bakery on Elm")
Government IDs — SSN, passport numbers, driver’s license numbers
Financial identifiers — credit card numbers, bank account numbers, IBAN
IP addresses (yes, even dynamic ones, under GDPR)
Device fingerprints and persistent cookie IDs
Biometric data — fingerprints, face descriptors, voice prints
Health and genetic data (a separate sensitive category)
Geolocation precise enough to identify a residence (typically < 100m)
Combinations of non-PII data that become identifying together (the classic: zip code + birth date + gender uniquely identifies most US residents)

That last one is the trap. A user’s zip code alone isn’t PII. Their birth date alone isn’t PII. Together with their gender, they become uniquely identifying for ~87% of the US population (Sweeney, 2000). So a database that holds "harmless" demographic fields can become a re-identification risk in aggregate.

Where automated detection works

Modern PII detection uses a combination of regex (for structured patterns like SSNs and credit cards), named-entity recognition (for names and locations), and small classifiers for context. The state of the art works well for:

Email addresses — near 100% precision and recall
Phone numbers in well-known formats — 95%+ when formats are documented per locale
Credit card numbers — near 100%, often with Luhn validation
SSN-style government IDs — high precision in the US, varies by country
Common given names in the language of training data — 85-95%

Where automated detection fails

The same detectors that get email addresses right will quietly fail on:

Uncommon names. Standard NER models are trained on Western media corpora. Names from other cultures get missed at much higher rates.
Names that look like common words. "Will", "May", "Hope", "Reed" — all common English given names that NER models often miss because they’re also common nouns.
Locations described in natural language. "The corner store next to the elementary school" can be enough to identify a residence in a small town. No regex catches this.
Re-identifiable combinations. No off-the-shelf detector catches the "zip + birth date + gender" problem because each field looks innocuous on its own.
PII embedded in images, voice, or video. If your application accepts uploads, pure-text detectors don’t help. You need OCR + image NER, or speech-to-text + text NER.
Adversarial obfuscation. A user who writes "john dot smith at gmail dot com" will defeat most regex-based email detectors.

The honest framing: automated PII detection is good enough to remove the obvious 80%, valuable enough to deploy, and not good enough to be your only defense. Treat it as a defense in depth, not a compliance checkbox.

Compliance triggers

GDPR

If you process personal data of EU residents, GDPR applies. Key obligations triggered by storing PII:

Lawful basis for processing (consent, contract, legitimate interest, etc.)
Data Subject Access Requests — users can ask what you have on them
Right to erasure — users can ask you to delete their data
Breach notification — 72 hours to notify the supervisory authority of a personal data breach
Data Protection Impact Assessment for large-scale or sensitive processing

CCPA / CPRA

If you do business in California and meet the size thresholds, CCPA/CPRA applies. Similar shape to GDPR with key differences: a clear opt-out from "sale" or "sharing" of personal information, distinct treatment of "sensitive personal information," and the right to limit use of sensitive PI.

The architecture pattern: redact before storage

The cleanest pattern is to detect and redact PII at the entry point, before the text reaches any long-term storage. Implementation:

User submits text
       |
       v
+----------------------+
| API gateway          |
| - rate limit         |
| - auth               |
+----------------------+
       |
       v
+----------------------+
| PII detector         |
| - returns: redacted  |
|   text + token map   |
+----------------------+
       |          \
       v           v
  to storage   to ephemeral cache
  (redacted)   (token -> original,
                15min TTL)

The token map is held only long enough for the immediate response cycle. After that, the original PII is gone — not recoverable from your storage. If a user later requests the unredacted version (and is authorized to receive it), you ask them to resubmit; you don’t store the original waiting for them.

This pattern has two big benefits: you radically shrink your breach blast radius, and you simplify GDPR "right to erasure" compliance because there’s less to erase.

When this pattern is wrong

Some applications genuinely need the original text. Customer support tickets, for example — redacting the customer’s name from the ticket makes it useless to the support agent. In these cases, the pattern flips: store the original, encrypt at rest, restrict access by role, log every read, and have a clear retention policy that triggers automatic deletion after a set time.

Quick reference

Best for: defense-in-depth on text endpoints, log scrubbing, anonymizing analytics
Avoid as your only defense. Automated detection misses 5-20% depending on the input. Pair it with access controls and retention limits.
Watch for: re-identifiable combinations, non-Western names, embedded media, adversarial users
Compliance triggers: GDPR (EU residents), CCPA (California, size-thresholded), HIPAA (US health data, separate regime)