Decision & comparison Ho Bae

PHI Masking vs Synthetic Data: Which Does Your Healthcare AI Actually Need?

phi masking vs synthetic data which do you need

PHI masking means altering or removing the identifiers inside protected health information, such as names, dates, and record numbers, so the data can no longer be readily traced to a specific patient.

By the CUBIG Research team, CUBIG Corp. · Updated June 2026.

That is the definition most teams arrive with. It is also where the trouble starts. Strip the obvious identifiers and you would expect the patient to vanish from the dataset. The research disagrees. A 2019 study in Nature Communications estimated that 99.98% of Americans could be correctly re-identified in any dataset using just 15 demographic attributes. Two decades earlier, Latanya Sweeney had already shown that 87% of the U.S. population was likely unique on nothing more than five-digit ZIP code, gender, and date of birth. Masking removes the name. It does not remove the person.

So the question in the title is the right one to ask before you spend a quarter building a pipeline. If you came in certain you needed PHI masking, it is worth knowing what masking actually buys you, where it runs out, and whether synthetic data solves a different problem than the one you have.

What PHI masking actually does
(and where it stops)

In U.S. healthcare, “masking” usually means one of the two de-identification methods written into the HIPAA Privacy Rule. The governing regulation, 45 CFR 164.514, sets out exactly two ways to satisfy the standard, and the HHS Office for Civil Rights names them plainly: Expert Determination and Safe Harbor.

Safe Harbor. You remove 18 specified identifiers, labeled (A) through (R) in the rule: names, all geographic units smaller than a state, every date element finer than a year, phone and fax numbers, email, Social Security and medical record numbers, account and license numbers, device and vehicle IDs, URLs and IP addresses, biometrics, full-face photos, and any other unique code. Clear the list and the data is treated as de-identified.

Expert Determination. A qualified statistician analyzes the dataset and certifies that the risk of re-identification is “very small.” Read that threshold again. The regulation does not say zero. It says very small, “alone or in combination with other reasonably available information.” The law itself assumes someone might still link a record back to a person; it just asks that the odds be remote.

This is the part most teams miss. Masked data is real patient data with the labels filed off. The records still describe actual people, which is the whole reason the data is useful and also the reason the link never fully disappears. NIST puts it cleanly in NISTIR 8053: “As long as any utility remains in the data derived from personal information, there also exists the possibility, however remote, that some information might be linked back to the original individuals.” The same report calls the goals of de-identification and data utility “antagonistic.” Push harder on privacy and you lose analytic value; keep the value and you keep some residual risk.

None of this means masking is broken. Done to a real standard, it works well. A systematic review of re-identification attacks on health data by El Emam and colleagues found that across studies an average of 26% of records were re-identified, and yet for the one attack on data de-identified to an established standard, the rate fell to 0.013%. Standards-based de-identification dramatically lowers exposure. What it cannot do is change the nature of the asset: a masked record is still a token that, under the wrong conditions, points at someone.

What synthetic data is, and why it answers a different question

Synthetic data is data generated by a model to reproduce the statistical structure of a real dataset without copying any real record. NIST catalogs the technique in its CSRC glossary and in SP 800-188. The distinction that matters for healthcare is not “fake versus real.” It is the mapping. A masked record corresponds one-to-one to a patient who walked into a clinic. A well-made synthetic record corresponds to no one. It carries the distributions, correlations, and edge cases of the source population, but there is no individual at the other end of the row to re-identify.

That shift is what makes synthetic data a structural answer rather than a stronger lock. You are not reducing the chance that a real person is exposed; you are generating records where no specific real person is present to begin with. When the generation is governed by differential privacy, the guarantee becomes mathematical. NIST defines it in SP 800-226 (2025) as “a mathematical framework that quantifies privacy loss to entities when their data appears in a dataset,” building on the formal definition introduced by Dwork, McSherry, Nissim, and Smith in 2006. Instead of arguing about whether 18 fields were enough, you can bound, with a tunable parameter, how much any single individual could have influenced the output.

The fair objection is utility. If the records are invented, do the analyses still hold? In a peer-reviewed comparison published in JMIR Medical Informatics, synthetic patient data reproduced the results of five real observational studies; across 1,000 synthetic iterations the estimate biases stayed small, on the order of −1.3% to 1.9%, and within the 95% confidence limits of the real-data results. Synthetic data is not automatically faithful. Built and validated properly, it can be faithful enough to stand in for the original.

PHI masking vs synthetic data: a side-by-side

01 1
DimensionPHI masking / de-identificationSynthetic data
What it producesReal patient records with identifiers removed or alteredModel-generated records with no one-to-one link to a real patient
Residual re-identification risk“Very small,” never zero; rises as auxiliary data grows (45 CFR 164.514; NISTIR 8053)No source individual in the record; bounded mathematically under differential privacy
Regulatory footingMature: explicit HIPAA Safe Harbor and Expert Determination methodsEmerging but real: used by the U.S. Census Bureau, NHS England, and accepted in FDA real-world-evidence guidance
Best forAudits, billing, operational reporting where the actual individuals must be preservedModel training, sharing across boundaries, augmenting rare cases, dev/test environments
Main limitationUtility falls as privacy rises; link to real people persistsQuality depends entirely on generation and validation; can miss what it was not built to preserve

So do you actually need synthetic data?

02 1

Sometimes masking is the correct and sufficient choice. If your use case requires the real individuals to remain in the data, a clinical audit, a billing reconciliation, a regulatory report tied to actual patients, then you are not looking for synthetic records at all. You need de-identification done to a defensible standard, with the Expert Determination documented.

Synthetic data earns its place when the value is in the patterns, not the people. Four signals tend to point that way:

  • You are training or testing models and need volume, including rare events that masked data is too thin to cover.
  • The data has to cross a boundary, to a vendor, a research partner, a cloud region, where moving real records is the bottleneck.
  • You want developers building before they ever touch live PHI. NHS England released SynAE, a synthetic Accident & Emergency dataset, for exactly this: let teams build and test against realistic data first.
  • Your re-identification exposure is already keeping a deal or a launch stuck, and “very small risk” is not clearing legal review.

The regulators have started to move with this logic. The U.S. Census Bureau protected the 2020 Census with a differentially private system, its TopDown Algorithm, the largest deployment of differential privacy in a national statistical product. In December 2025 the FDA removed a barrier to using de-identified real-world data from electronic health records, claims, and registries in certain device submissions. And a 2025 comment in The Lancet Digital Health argued for accelerating synthetic-data privacy frameworks for medical research. The direction of travel is clear, even where the standards are still forming.

The question underneath the question: what state does your data need to be in?

03 1

Here is the reframe that saves teams a wasted quarter. Masking and synthesis are both techniques. Neither is the goal. The goal is data your AI workflow can actually run on, share, and answer for later. Picking “masking” or “synthetic data” up front is choosing a tool before you have named the job.

The job, in regulated work, has a second half that both techniques tend to ignore. Suppose you generate a synthetic cohort, train a model, and ship it. Six months later an auditor asks which dataset produced which result, and whether you can regenerate it. Can you point to the exact release, the parameters, the lineage back to the source? A synthetic file with no record of how it was made is not audit-ready, however private it is. Privacy without traceability is half an answer.

This is the layer CUBIG works on. AI-ready data is not just data that is safe to use; it is data that is usable, has its structure and context intact, and stays reproducible and traceable run to run. DTS, the AI-ready data transformation engine inside Syntitan, turns locked or restricted health data into that state. It preserves the statistical structure, patterns, and correlations of the source, uses differential privacy to control re-identification mathematically, and calibrates the output so models trained on it perform close to models trained on the original. The output can be synthetic; the point is the state, not the label. Syntitan then binds each result to a release you can diff and reproduce, so the privacy decision and the audit trail live in the same place.

Put differently: synthetic data may well be the right technique for your project. But “do I need synthetic data?” is the smaller question. “Can I get my restricted data into an AI-ready state I can run on, share, and reproduce on demand?” is the one that decides whether the project ships.

A 60-second self-diagnosis

Run your use case through these. More boxes on the right means synthetic data, or a transformation that produces it, is the stronger fit.

  • Do you need the actual individuals preserved (audit, billing), or only the patterns (training, analytics)?
  • Does the data have to leave a trusted environment to be useful?
  • Is “very small re-identification risk” still failing legal or partner review?
  • Do you need rare cases or volume the real data cannot supply?
  • Will someone later ask you to reproduce exactly which data produced which result?

If the last box is checked, no masking or synthesis technique alone is enough. You need the data in a state that is private and reproducible at once.

Bring one restricted dataset and let Syntitan transform it into an AI-ready state: structure preserved, re-identification controlled with differential privacy, and every result bound to a release you can reproduce and audit later. Want proof before you commit? Run a sample and check the utility against your own benchmark before any original data leaves your environment.

image 10

FAQ

Is PHI masking the same as de-identification?

In practice, "PHI masking" usually refers to de-identification under the HIPAA Privacy Rule, achieved through either Safe Harbor (removing 18 identifiers) or Expert Determination (a statistician certifies very small re-identification risk), per 45 CFR 164.514. Masking is the everyday word; de-identification is the regulatory standard behind it.

Is synthetic data HIPAA compliant?

Synthetic data that contains no real patient records is generally outside the scope of PHI, because there is no individual to identify. Compliance still depends on how it is generated: if the model can memorize and leak real records, the privacy claim weakens. Generation under differential privacy gives a measurable bound, which is why standards bodies like NIST treat it as the rigorous path.

Does synthetic data keep the analysis accurate?

It can. A peer-reviewed comparison in JMIR Medical Informatics found synthetic patient data reproduced the findings of five real observational studies, with estimate biases within the real-data 95% confidence limits. Accuracy is not automatic; it depends on the generation method and on validating the synthetic set against the real one for your specific use case.

If masking is "very small" risk, why not just use it?

For many operational uses, you should. Masking becomes the weaker choice when the residual link to real people blocks data sharing, fails legal review, or limits model training, and when you need volume or rare cases the real data cannot provide. Those are the conditions where synthetic data, or a transformation that produces it, pulls ahead.