PII Detection for SaaS: A Practical Field Guide

Updated May 2, 2026 · ~10 minute read

Your application accepts text from users. That text contains personally identifying information whether you wanted it to or not. This guide walks through what counts as PII (it’s broader than you think), where automated detection works and where it fails, the relevant compliance triggers in GDPR and CCPA, and a small architecture pattern that lets you redact PII before it lands in long-term storage.

What actually counts as PII

The category of "personally identifying information" is wider than the obvious things (names, email addresses, phone numbers). Both GDPR and CCPA treat the following as personal data:

That last one is the trap. A user’s zip code alone isn’t PII. Their birth date alone isn’t PII. Together with their gender, they become uniquely identifying for ~87% of the US population (Sweeney, 2000). So a database that holds "harmless" demographic fields can become a re-identification risk in aggregate.

Where automated detection works

Modern PII detection uses a combination of regex (for structured patterns like SSNs and credit cards), named-entity recognition (for names and locations), and small classifiers for context. The state of the art works well for:

Where automated detection fails

The same detectors that get email addresses right will quietly fail on:

The honest framing: automated PII detection is good enough to remove the obvious 80%, valuable enough to deploy, and not good enough to be your only defense. Treat it as a defense in depth, not a compliance checkbox.

Compliance triggers

GDPR

If you process personal data of EU residents, GDPR applies. Key obligations triggered by storing PII:

CCPA / CPRA

If you do business in California and meet the size thresholds, CCPA/CPRA applies. Similar shape to GDPR with key differences: a clear opt-out from "sale" or "sharing" of personal information, distinct treatment of "sensitive personal information," and the right to limit use of sensitive PI.

The architecture pattern: redact before storage

The cleanest pattern is to detect and redact PII at the entry point, before the text reaches any long-term storage. Implementation:

User submits text
       |
       v
+----------------------+
| API gateway          |
| - rate limit         |
| - auth               |
+----------------------+
       |
       v
+----------------------+
| PII detector         |
| - returns: redacted  |
|   text + token map   |
+----------------------+
       |          \
       v           v
  to storage   to ephemeral cache
  (redacted)   (token -> original,
                15min TTL)

The token map is held only long enough for the immediate response cycle. After that, the original PII is gone — not recoverable from your storage. If a user later requests the unredacted version (and is authorized to receive it), you ask them to resubmit; you don’t store the original waiting for them.

This pattern has two big benefits: you radically shrink your breach blast radius, and you simplify GDPR "right to erasure" compliance because there’s less to erase.

When this pattern is wrong

Some applications genuinely need the original text. Customer support tickets, for example — redacting the customer’s name from the ticket makes it useless to the support agent. In these cases, the pattern flips: store the original, encrypt at rest, restrict access by role, log every read, and have a clear retention policy that triggers automatic deletion after a set time.

Quick reference