PHI masking means altering or removing the identifiers inside protected health information, such as names, dates, and record numbers, so the data can no longer be readily traced to a specific patient.
By the CUBIG Research team, CUBIG Corp. · Updated June 2026.
That is the definition most teams arrive with. It is also where the trouble starts. Strip the obvious identifiers and you would expect the patient to vanish from the dataset. The research disagrees. A 2019 study in Nature Communications estimated that 99.98% of Americans could be correctly re-identified in any dataset using just 15 demographic attributes. Two decades earlier, Latanya Sweeney had already shown that 87% of the U.S. population was likely unique on nothing more than five-digit ZIP code, gender, and date of birth. Masking removes the name. It does not remove the person.
So the question in the title is the right one to ask before you spend a quarter building a pipeline. If you came in certain you needed PHI masking, it is worth knowing what masking actually buys you, where it runs out, and whether synthetic data solves a different problem than the one you have.
What PHI masking actually does
(and where it stops)
In U.S. healthcare, “masking” usually means one of the two de-identification methods written into the HIPAA Privacy Rule. The governing regulation, 45 CFR 164.514, sets out exactly two ways to satisfy the standard, and the HHS Office for Civil Rights names them plainly: Expert Determination and Safe Harbor.
Safe Harbor. You remove 18 specified identifiers, labeled (A) through (R) in the rule: names, all geographic units smaller than a state, every date element finer than a year, phone and fax numbers, email, Social Security and medical record numbers, account and license numbers, device and vehicle IDs, URLs and IP addresses, biometrics, full-face photos, and any other unique code. Clear the list and the data is treated as de-identified.
Expert Determination. A qualified statistician analyzes the dataset and certifies that the risk of re-identification is “very small.” Read that threshold again. The regulation does not say zero. It says very small, “alone or in combination with other reasonably available information.” The law itself assumes someone might still link a record back to a person; it just asks that the odds be remote.
This is the part most teams miss. Masked data is real patient data with the labels filed off. The records still describe actual people, which is the whole reason the data is useful and also the reason the link never fully disappears. NIST puts it cleanly in NISTIR 8053: “As long as any utility remains in the data derived from personal information, there also exists the possibility, however remote, that some information might be linked back to the original individuals.” The same report calls the goals of de-identification and data utility “antagonistic.” Push harder on privacy and you lose analytic value; keep the value and you keep some residual risk.
None of this means masking is broken. Done to a real standard, it works well. A systematic review of re-identification attacks on health data by El Emam and colleagues found that across studies an average of 26% of records were re-identified, and yet for the one attack on data de-identified to an established standard, the rate fell to 0.013%. Standards-based de-identification dramatically lowers exposure. What it cannot do is change the nature of the asset: a masked record is still a token that, under the wrong conditions, points at someone.
What synthetic data is, and why it answers a different question
Synthetic data is data generated by a model to reproduce the statistical structure of a real dataset without copying any real record. NIST catalogs the technique in its CSRC glossary and in SP 800-188. The distinction that matters for healthcare is not “fake versus real.” It is the mapping. A masked record corresponds one-to-one to a patient who walked into a clinic. A well-made synthetic record corresponds to no one. It carries the distributions, correlations, and edge cases of the source population, but there is no individual at the other end of the row to re-identify.
That shift is what makes synthetic data a structural answer rather than a stronger lock. You are not reducing the chance that a real person is exposed; you are generating records where no specific real person is present to begin with. When the generation is governed by differential privacy, the guarantee becomes mathematical. NIST defines it in SP 800-226 (2025) as “a mathematical framework that quantifies privacy loss to entities when their data appears in a dataset,” building on the formal definition introduced by Dwork, McSherry, Nissim, and Smith in 2006. Instead of arguing about whether 18 fields were enough, you can bound, with a tunable parameter, how much any single individual could have influenced the output.
The fair objection is utility. If the records are invented, do the analyses still hold? In a peer-reviewed comparison published in JMIR Medical Informatics, synthetic patient data reproduced the results of five real observational studies; across 1,000 synthetic iterations the estimate biases stayed small, on the order of −1.3% to 1.9%, and within the 95% confidence limits of the real-data results. Synthetic data is not automatically faithful. Built and validated properly, it can be faithful enough to stand in for the original.
PHI masking vs synthetic data: a side-by-side

| Dimension | PHI masking / de-identification | Synthetic data |
|---|---|---|
| What it produces | Real patient records with identifiers removed or altered | Model-generated records with no one-to-one link to a real patient |
| Residual re-identification risk | “Very small,” never zero; rises as auxiliary data grows (45 CFR 164.514; NISTIR 8053) | No source individual in the record; bounded mathematically under differential privacy |
| Regulatory footing | Mature: explicit HIPAA Safe Harbor and Expert Determination methods | Emerging but real: used by the U.S. Census Bureau, NHS England, and accepted in FDA real-world-evidence guidance |
| Best for | Audits, billing, operational reporting where the actual individuals must be preserved | Model training, sharing across boundaries, augmenting rare cases, dev/test environments |
| Main limitation | Utility falls as privacy rises; link to real people persists | Quality depends entirely on generation and validation; can miss what it was not built to preserve |
So do you actually need synthetic data?

Sometimes masking is the correct and sufficient choice. If your use case requires the real individuals to remain in the data, a clinical audit, a billing reconciliation, a regulatory report tied to actual patients, then you are not looking for synthetic records at all. You need de-identification done to a defensible standard, with the Expert Determination documented.
Synthetic data earns its place when the value is in the patterns, not the people. Four signals tend to point that way:
- You are training or testing models and need volume, including rare events that masked data is too thin to cover.
- The data has to cross a boundary, to a vendor, a research partner, a cloud region, where moving real records is the bottleneck.
- You want developers building before they ever touch live PHI. NHS England released SynAE, a synthetic Accident & Emergency dataset, for exactly this: let teams build and test against realistic data first.
- Your re-identification exposure is already keeping a deal or a launch stuck, and “very small risk” is not clearing legal review.
The regulators have started to move with this logic. The U.S. Census Bureau protected the 2020 Census with a differentially private system, its TopDown Algorithm, the largest deployment of differential privacy in a national statistical product. In December 2025 the FDA removed a barrier to using de-identified real-world data from electronic health records, claims, and registries in certain device submissions. And a 2025 comment in The Lancet Digital Health argued for accelerating synthetic-data privacy frameworks for medical research. The direction of travel is clear, even where the standards are still forming.
The question underneath the question: what state does your data need to be in?

Here is the reframe that saves teams a wasted quarter. Masking and synthesis are both techniques. Neither is the goal. The goal is data your AI workflow can actually run on, share, and answer for later. Picking “masking” or “synthetic data” up front is choosing a tool before you have named the job.
The job, in regulated work, has a second half that both techniques tend to ignore. Suppose you generate a synthetic cohort, train a model, and ship it. Six months later an auditor asks which dataset produced which result, and whether you can regenerate it. Can you point to the exact release, the parameters, the lineage back to the source? A synthetic file with no record of how it was made is not audit-ready, however private it is. Privacy without traceability is half an answer.
This is the layer CUBIG works on. AI-ready data is not just data that is safe to use; it is data that is usable, has its structure and context intact, and stays reproducible and traceable run to run. DTS, the AI-ready data transformation engine inside Syntitan, turns locked or restricted health data into that state. It preserves the statistical structure, patterns, and correlations of the source, uses differential privacy to control re-identification mathematically, and calibrates the output so models trained on it perform close to models trained on the original. The output can be synthetic; the point is the state, not the label. Syntitan then binds each result to a release you can diff and reproduce, so the privacy decision and the audit trail live in the same place.
Put differently: synthetic data may well be the right technique for your project. But “do I need synthetic data?” is the smaller question. “Can I get my restricted data into an AI-ready state I can run on, share, and reproduce on demand?” is the one that decides whether the project ships.
A 60-second self-diagnosis
Run your use case through these. More boxes on the right means synthetic data, or a transformation that produces it, is the stronger fit.
- Do you need the actual individuals preserved (audit, billing), or only the patterns (training, analytics)?
- Does the data have to leave a trusted environment to be useful?
- Is “very small re-identification risk” still failing legal or partner review?
- Do you need rare cases or volume the real data cannot supply?
- Will someone later ask you to reproduce exactly which data produced which result?
If the last box is checked, no masking or synthesis technique alone is enough. You need the data in a state that is private and reproducible at once.
Bring one restricted dataset and let Syntitan transform it into an AI-ready state: structure preserved, re-identification controlled with differential privacy, and every result bound to a release you can reproduce and audit later. Want proof before you commit? Run a sample and check the utility against your own benchmark before any original data leaves your environment.
