Scanning S3 for PHI With Macie

October 23, 2028 · 18 min read

The situation

The SaaS estate:

30 AWS accounts in a single AWS Organization, with a delegated security account (555555555555) the platform security team owns.
4,000 S3 general-purpose buckets between them. Some are designated PHI stores (phi-*, KMS-encrypted with a customer-managed key); most are application logs, build artefacts, pipeline scratch, analytics exports, backup staging, user-uploaded files.
Designated PHI datasets hold patient records, lab results, appointment notes; the fields that matter are SSNs, medical record numbers, dates of birth, insurance claim numbers, and payment-card numbers retained for historical billing.
Compliance: HIPAA, SOC 2 Type II, and a contractual review with a major payer in ninety days.

Security wants continuous scanning for PHI, PII, and payment patterns, automated alerting on unexpected matches, a risk-ranked dashboard for triage, built-in detectors for the common types plus a way to extend them, and one enablement across all thirty accounts with one place to triage from.

What actually matters

Before mapping services to the ask, worth naming the deeper properties of an answer.

The first thing to notice is that the problem is a content problem, not a metadata problem. A bucket might be named build-artefacts, tagged pii=false, and still contain a CSV with 1,200 rows of patient data that someone’s pipeline dumped last Tuesday. Tools that classify by tag or by bucket name miss this by design, they trust the claim. Tools that classify by content don’t, they read the bytes and tell you what they see. For a HIPAA and SOC 2 estate, “trust the claim” is not an answer an auditor accepts. The architecture has to look at content.

Scale changes what “looking at content” can mean. Four thousand buckets containing an indefinite number of objects is not something a single scan job runs over in a weekend. The feasible shape is continuous sampling, inspect a statistically meaningful slice of each bucket on a daily cadence, roll the results into a per-bucket sensitivity score, and produce a heat map that calls attention to the buckets trending hot. Once the heat map names a target, a deep scan job can inspect that one bucket exhaustively. Breadth is the always-on mode; depth is the on-demand tool once breadth has found something.

Detector coverage matters more than it sounds. SSN detection is table stakes; so are credit-card numbers (with and without surrounding keywords, Luhn-validated). The harder ones are the healthcare identifiers. Medicare Beneficiary Identifier, UK NHS number, Canadian and French health numbers, medical device UDIs, and the scenario-specific identifiers like MRN formats, which vary by provider. Built-in coverage handles the common types; an extension mechanism (regex plus keyword proximity) handles the proprietary ones. A tool that requires writing custom detectors for all of it is a project; a tool with managed identifiers plus a path to extend is a service.

Signal-to-noise ratio is the property that determines whether the service gets used. At 4,000 buckets, thousands of matches a week is plausible, and most of them will be the designated PHI buckets doing exactly what they should. Alerting on every finding drowns the team. The workable pattern is: designated PHI buckets expected to score high, findings flowing via Security Hub for posture evidence; non-designated buckets scoring anything non-zero are the alert, routed through EventBridge with severity filtering and optional allow-lists for known test fixtures. What stays in the unarchived queue is only what’s surprising.

Finally, cost shape has to scale linearly with the estate. A per-bucket fee plus a per-object monitoring fee plus a per-GB-inspected fee is transparent and estimable from the numbers the team already has. A per-seat licence or a tiered “enterprise” deal breaks the linearity. For an engineering team that’s going to grow the bucket count, linear-in-buckets is the shape that stays predictable.

What we’ll filter on

PHI and PII pattern detection out of the box. SSN, DOB, credit card, health identifiers, plus clean extension for custom regex.
Continuous scanning across the whole S3 estate, not a one-off job.
Cross-account roll-up, thirty member accounts enrolled once, visible from the security account.
Operational fit, findings into Security Hub and EventBridge, suppression org-wide, auto-enrol for new accounts.
Cost shape, a billing model estimable from bucket and object counts.

The data-classification landscape

Four places this can sit.

1. Amazon Macie. AWS’s managed data security service for S3. Two scanning modes: automated sensitive data discovery runs continuously across the bucket inventory using sampling; sensitive data discovery jobs are targeted deep scans, one-off or scheduled. Detection uses managed data identifiers (built-in patterns for SSN, DOB, credit card, Medicare Beneficiary Identifier, UK NHS number, Canadian and French health numbers, medical device UDIs, and many more) plus custom data identifiers (regex with optional keyword proximity and character-count constraints). Findings land in Security Hub as ASFF and EventBridge as events. Delegated administrator via AWS Organizations enrols every member account on join. Pricing: $0.10 per bucket per month, $0.01 per 100,000 objects monitored for automated discovery, $1 per GB inspected.

2. DIY with Athena plus custom-regex Lambdas. S3 Inventory to manifest buckets, Athena to sample, a Lambda to pull each object and regex the bytes, findings to DynamoDB or SNS. Works for text. Works less well for office documents, PDFs, parquet, gzipped JSON, and images with embedded text. Encodings, Unicode normalisation, keyword proximity, and false-positive suppression turn into a project of their own, with operational burden landing on the security team.

3. Third-party DLP and data-classification platforms. BigID, Varonis, Cyera, and similar. Vendors integrate via cross-account IAM and scan against their own classifier libraries. Often strong on detectors and combined discovery-plus-access-governance dashboards. Separate commercial relationship, separate identity trust, separate place to look when the first alert fires.

4. S3 Object Tagging plus AWS Config rules. Tag objects or buckets pii=true / phi=true at creation; Config rules raise findings when tags are missing. Coarse, it classifies by claim (the tag) rather than content. A file with PHI accidentally uploaded to a pii=false bucket is invisible by design. A useful metadata control on top of real classification, not a replacement for it.

Side by side

Option	PHI/PII detection	Continuous scanning	Cross-account roll-up	Operational fit	Cost shape
Amazon Macie	✓	✓	✓	✓	✓
DIY Athena + Lambda	,	✗	✗	✗	✗
Third-party DLP	✓	✓	✓	,	,
Object Tagging + Config	✗	✗	✓	✓	✓

Macie is the only row with all ticks. DIY gets partial credit on detection but collapses on continuous scanning and operational fit. Third-party DLP solves the problem too, the tradeoff is trust boundary and vendor dependency. Object Tagging doesn’t look at content, so “unexpected PHI” is precisely what it misses.

Macie across the estate

Thirty member accounts with S3 buckets; Macie per-account; findings roll up to the delegated administrator and onward to Security Hub as ASFF and EventBridge as events.

What Macie actually detects

Managed data identifiers are AWS-maintained patterns combining regex, keyword-proximity rules, and in many cases ML classifiers. For this estate: USA_SOCIAL_SECURITY_NUMBER (keyword-required), DATE_OF_BIRTH, USA_PASSPORT_NUMBER, driving-license variants; USA_HEALTH_INSURANCE_CLAIM_NUMBER, USA_MEDICARE_BENEFICIARY_IDENTIFIER, UK_NHS_NUMBER, CANADA_HEALTH_NUMBER, FRANCE_HEALTH_INSURANCE_NUMBER, MEDICAL_DEVICE_UDI; CREDIT_CARD_NUMBER (keyword-proximity) and CREDIT_CARD_NUMBER_(NO_KEYWORD) (Luhn-validated without a keyword); AWS secret access keys and private keys.

“Medical record number” is not a single managed detector. MRN formats vary by provider, so that’s where a custom data identifier earns its place. A CDI is a regex with optional refinements: a required keyword list to match near the regex hit (cutting false positives where naked digits are common), a maximum match distance, and character-count windows:

{
  "name": "patient-mrn",
  "regex": "\\b[A-Z]{2}-\\d{7}\\b",
  "keywords": ["MRN", "medical record", "patient number", "chart"],
  "maximumMatchDistance": 50
}

Match AA-1234567 style identifiers, but only within 50 characters of a keyword like “MRN” or “chart”. The same pattern serves payer member numbers and patient external IDs. CDIs are defined at the delegated-administrator level and apply organisation-wide.

Allow lists are the other half. A test dataset in sandbox-fixtures full of fake SSNs from a known test provider would otherwise fire every scan. An allow list, literal strings or a regex, tells Macie to ignore those matches. Compliance-safe when documented; dangerous when used to hide real findings.

Automated discovery versus discovery jobs

Automated sensitive data discovery is the continuous mode. It evaluates the bucket inventory daily, picks representative objects by sampling (weighted toward larger buckets and types not recently sampled), runs the managed-plus-custom identifier set, and rolls results into sensitivity scores per bucket. The headline output is an interactive heat map: one tile per bucket, coloured by score, grouped by account. A tile trending hot when it was cool yesterday is the signal. Sampling is what makes 4,000-bucket continuous coverage tractable rather than inspecting every byte every day.

Sensitive data discovery jobs are the targeted mode. Scope: one or more buckets, explicit. Depth: configurable sampling up to 100%, so a job can be exhaustive in a way automated discovery never is. Schedule: one-off or recurring. Use a job when the heat map highlights a bucket and the question shifts from “is there something in here?” to “how much, and where exactly?”, or when compliance asks for a point-in-time deep scan of the designated PHI stores before an audit.

Organisation deployment

From the management account, the security account is designated delegated administrator for Macie. From then on, every Macie setting, managed identifiers, custom data identifiers, allow lists, suppression rules, finding exports, automated discovery scope, lives in the security account and applies org-wide. Auto-enable enrols new member accounts on join. Findings in every member account are visible from the security account’s Macie console without cross-account IAM.

Findings flow onward two ways. AWS Security Hub ingests Macie findings as ASFF alongside Inspector, GuardDuty, Config, and Access Analyzer. Amazon EventBridge emits events on finding creation, update, and archival; a rule filtering detail.severity.description = "High" can page via PagerDuty, create a Jira ticket, or invoke a remediation Lambda.

The heat map and the triage flow

Four thousand rows of findings is a spreadsheet nobody reads. Four thousand coloured tiles on a heat map is a picture the on-call can read in thirty seconds.

Each tile is one bucket. Colour encodes the sensitivity score, a 0-to-100 number Macie derives from the types and counts of sensitive data it’s found, weighted by managed-identifier severity (health and financial PII rank high; business data lower). Grouping is by account by default. Key readings:

Designated PHI buckets (phi-*) scoring hot is expected; the signal is change, a newly-hot bucket outside the designated set is the alert.
Non-PHI buckets scoring anything non-zero is the finding that matters. A build-artefacts bucket trending orange means something wrote sensitive content into it.
Coverage gaps show as distinct tile states: not-yet-sampled (new), excluded (opt-out documented), errored (investigate).

A sensitive data finding carries type (SensitiveData:S3Object/Personal, /Financial, /Credentials, /CustomIdentifier, /Multiple), severity.description (Low, Medium, High), the affected s3Object (bucket, key, size, KMS config), and classificationDetails naming the identifiers that matched with occurrence counts and line offsets.

The triage flow: Macie finding fires on a non-designated bucket, say a CSV in analytics-exports with 1,200 matches of USA_SOCIAL_SECURITY_NUMBER and 1,200 of DATE_OF_BIRTH, severity High. EventBridge matches on source: aws.macie2, detail.severity.description: High, and bucket name not in the designated-PHI allow-list. Remediation Lambda tags the object macie:phi-detected=true, attaches a deny-all bucket policy scoped to non-privileged principals until a human reviews, raises a high-priority ticket, pages the on-call. Security Hub holds the finding as ASFF for quarterly SOC 2 evidence. Engineering fixes the pipeline, moves or destroys the object, closes the finding.

Cost shape at 4,000 buckets

The numbers the team can estimate today.

Bucket monitoring: 4,000 × $0.10 = $400/month.
Object monitoring for automated discovery: ~500M objects = 500,000,000 / 100,000 × $0.01 = $50/month.
Automated discovery (bytes inspected): sampling means Macie doesn’t inspect every byte. Conservative estimate at this scale is 2-5 TB/month inspected = $2,000-$5,000/month.
Scheduled jobs on the phi-* buckets (say 400 GB monthly) = $400/month.

Low single-digit thousands of dollars a month for thirty-account, four-thousand-bucket continuous classification. Linear in the estate, doubling the bucket count doubles the bill, and exclusions (logging buckets, known-low-risk staging) bring it down without reducing coverage of what matters.

What’s worth remembering

Amazon Macie is AWS’s managed data-classification service for S3, bucket inventory, bucket security controls, and sensitive-data content scanning in one service with cross-account roll-up via AWS Organizations.
Managed data identifiers cover SSN, DOB, credit card (with and without keyword proximity), health insurance identifiers (HICN, Medicare Beneficiary Identifier, NHS, Canadian and French health numbers), medical device UDIs, passports, driving licences, AWS secrets.
Custom data identifiers extend detection with regex plus keyword proximity plus character-count constraints, the mechanism for MRN formats, payer member numbers, proprietary patient IDs.
Allow lists suppress known-safe matches (test fixtures, public reference data); suppression rules suppress findings for documented operational reasons.
Automated sensitive data discovery is continuous and sampling-based, producing sensitivity scores and an interactive heat map for estate-scale triage.
Sensitive data discovery jobs are targeted and configurable from sampling to exhaustive, scheduled one-off or recurring, the deep-scan tool once the heat map names a target.
Delegated administrator via AWS Organizations is the deployment pattern: one enablement in the security account, auto-enable for new members, findings visible org-wide from one console.
Findings flow to Security Hub as ASFF and to EventBridge as events. Security Hub for posture across services, EventBridge for severity-based routing.
Pricing has three dimensions: buckets ($0.10/bucket/month), objects monitored for automated discovery ($0.01/100k objects/month), and bytes inspected ($1/GB). Linear in the estate.
Macie is S3-only, it does not scan EC2, RDS, DynamoDB, or file systems. For EC2/Lambda/ECR vulnerability scanning the service is Inspector; for threat detection across CloudTrail and VPC flow logs it’s GuardDuty.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.