PII Detection for SaaS: A Practical Field Guide
Your application accepts text from users. That text contains personally identifying information whether you wanted it to or not. This guide walks through what counts as PII (it’s broader than you think), where automated detection works and where it fails, the relevant compliance triggers in GDPR and CCPA, and a small architecture pattern that lets you redact PII before it lands in long-term storage.
What actually counts as PII
The category of "personally identifying information" is wider than the obvious things (names, email addresses, phone numbers). Both GDPR and CCPA treat the following as personal data:
- Names and email addresses
- Phone numbers
- Postal addresses, including partial addresses (e.g. "the apartment above the bakery on Elm")
- Government IDs — SSN, passport numbers, driver’s license numbers
- Financial identifiers — credit card numbers, bank account numbers, IBAN
- IP addresses (yes, even dynamic ones, under GDPR)
- Device fingerprints and persistent cookie IDs
- Biometric data — fingerprints, face descriptors, voice prints
- Health and genetic data (a separate sensitive category)
- Geolocation precise enough to identify a residence (typically < 100m)
- Combinations of non-PII data that become identifying together (the classic: zip code + birth date + gender uniquely identifies most US residents)
That last one is the trap. A user’s zip code alone isn’t PII. Their birth date alone isn’t PII. Together with their gender, they become uniquely identifying for ~87% of the US population (Sweeney, 2000). So a database that holds "harmless" demographic fields can become a re-identification risk in aggregate.
Where automated detection works
Modern PII detection uses a combination of regex (for structured patterns like SSNs and credit cards), named-entity recognition (for names and locations), and small classifiers for context. The state of the art works well for:
- Email addresses — near 100% precision and recall
- Phone numbers in well-known formats — 95%+ when formats are documented per locale
- Credit card numbers — near 100%, often with Luhn validation
- SSN-style government IDs — high precision in the US, varies by country
- Common given names in the language of training data — 85-95%
Where automated detection fails
The same detectors that get email addresses right will quietly fail on:
- Uncommon names. Standard NER models are trained on Western media corpora. Names from other cultures get missed at much higher rates.
- Names that look like common words. "Will", "May", "Hope", "Reed" — all common English given names that NER models often miss because they’re also common nouns.
- Locations described in natural language. "The corner store next to the elementary school" can be enough to identify a residence in a small town. No regex catches this.
- Re-identifiable combinations. No off-the-shelf detector catches the "zip + birth date + gender" problem because each field looks innocuous on its own.
- PII embedded in images, voice, or video. If your application accepts uploads, pure-text detectors don’t help. You need OCR + image NER, or speech-to-text + text NER.
- Adversarial obfuscation. A user who writes "john dot smith at gmail dot com" will defeat most regex-based email detectors.
The honest framing: automated PII detection is good enough to remove the obvious 80%, valuable enough to deploy, and not good enough to be your only defense. Treat it as a defense in depth, not a compliance checkbox.
Compliance triggers
GDPR
If you process personal data of EU residents, GDPR applies. Key obligations triggered by storing PII:
- Lawful basis for processing (consent, contract, legitimate interest, etc.)
- Data Subject Access Requests — users can ask what you have on them
- Right to erasure — users can ask you to delete their data
- Breach notification — 72 hours to notify the supervisory authority of a personal data breach
- Data Protection Impact Assessment for large-scale or sensitive processing
CCPA / CPRA
If you do business in California and meet the size thresholds, CCPA/CPRA applies. Similar shape to GDPR with key differences: a clear opt-out from "sale" or "sharing" of personal information, distinct treatment of "sensitive personal information," and the right to limit use of sensitive PI.
The architecture pattern: redact before storage
The cleanest pattern is to detect and redact PII at the entry point, before the text reaches any long-term storage. Implementation:
User submits text
|
v
+----------------------+
| API gateway |
| - rate limit |
| - auth |
+----------------------+
|
v
+----------------------+
| PII detector |
| - returns: redacted |
| text + token map |
+----------------------+
| \
v v
to storage to ephemeral cache
(redacted) (token -> original,
15min TTL)
The token map is held only long enough for the immediate response cycle. After that, the original PII is gone — not recoverable from your storage. If a user later requests the unredacted version (and is authorized to receive it), you ask them to resubmit; you don’t store the original waiting for them.
This pattern has two big benefits: you radically shrink your breach blast radius, and you simplify GDPR "right to erasure" compliance because there’s less to erase.
When this pattern is wrong
Some applications genuinely need the original text. Customer support tickets, for example — redacting the customer’s name from the ticket makes it useless to the support agent. In these cases, the pattern flips: store the original, encrypt at rest, restrict access by role, log every read, and have a clear retention policy that triggers automatic deletion after a set time.
Quick reference
- Best for: defense-in-depth on text endpoints, log scrubbing, anonymizing analytics
- Avoid as your only defense. Automated detection misses 5-20% depending on the input. Pair it with access controls and retention limits.
- Watch for: re-identifiable combinations, non-Western names, embedded media, adversarial users
- Compliance triggers: GDPR (EU residents), CCPA (California, size-thresholded), HIPAA (US health data, separate regime)