# QA Report ## Schema and Domain Validation - Overall: PASS - Schema: PASS - Domain: PASS ### schema - missing_columns: [] - type_mismatches: {} ### domain - violations: {} ### summary - schema_ok: True - domain_ok: True ## Label Prevalence Checks - Diagnosis: {'distribution': {np.str_('Healthy'): 0.2482, np.str_('Common Cold'): 0.121, np.str_('Hypertension'): 0.1004, np.str_('Diabetes'): 0.0872, np.str_('Influenza'): 0.0802, np.str_('Depression'): 0.0632, np.str_('Heart Disease'): 0.0612, np.str_('Asthma'): 0.0508, np.str_('Pneumonia'): 0.0448, np.str_('Cancer'): 0.0438, np.str_('Migraine'): 0.0316, np.str_('Gastroenteritis'): 0.0268, np.str_('Arthritis'): 0.0186, np.str_('Bronchitis'): 0.0112, np.str_('Anxiety'): 0.011}} - Outcome: {'distribution': {np.str_('Recovered'): 0.5716, np.str_('Ongoing Treatment'): 0.4076, np.str_('Deteriorated'): 0.017, np.str_('Deceased'): 0.0038}} ## Data Quality Summary - Overall completeness: 1.0 - Top categorical fields: Patient_ID, Gender, Race, Ethnicity, Marital_Status - Numeric correlations present: yes ## PII Scan Summary - Regex pattern totals: {'name': 6814, 'email': 0, 'phone': 0, 'address': 0, 'ssn': 1150} ## Artifacts - [stats-summary.json](stats-summary.json) - [schema-validation.json](schema-validation.json) - [label-prevalence.json](label-prevalence.json) - [generation_summary.json](generation_summary.json) - [run-meta.json](run-meta.json) - [reidentification-risk.md](reidentification-risk.md) - [datacard.md](datacard.md) - [pii-scan-report.json](pii-scan-report.json)
Research & Validation
Methods, validation, bias, leakage proxies, governance, and limitations.
Age Distribution

Clinical Correlations

Visit Type Proportions

Demographics

PII Scan Counts

Leakage Proxy

QA Report
Re-identification Risk
# Re-identification Risk Assessment This dataset is fully synthetic. The following checks are provided as a conservative governance measure. ## Quantitative Proxies - Uniqueness proxy (lower is safer): 0.9912 - k-anonymity proxy (average equivalence class size): 1.01 - Quasi-identifiers considered: Age, Gender, Visit_Date, Provider_Type - Total equivalence classes: 4956 - Minimum group size: 1 - Maximum group size: 2 ## Quasi-Identifier Combination Analysis High-risk combinations (uniqueness > 10%): - Age_Gender: 12.4% unique groups (29 total) - Age_Gender_Provider_Type: 18.5% unique groups (135 total) - Age_Gender_Insurance_Type: 21.4% unique groups (179 total) ## Linkage Attack Resistance - Institution concentration: Top institution represents 0.0% of records - Rare age-gender combinations: 25.2% of combinations are rare (≤3 records) ## PII Redaction Analysis - No text fields available for PII analysis ## Qualitative Assessment - No direct identifiers are generated (names, emails, SSN). - Text fields are scanned and redacted for potential PII patterns using regex. - Demographic and clinical attributes are sampled from broad distributions to reduce linkage risk. - Quasi-identifiers are generalized or binned to prevent re-identification. - For deployment, consider additional aggregation or binning of quasi-identifiers if needed. - Differential privacy or noise addition may be considered for high-risk deployments.
Data Card (Excerpt)
# Data Card — structured_health_records vv1 ## Overview - Dataset: structured_health_records - Version: v1 - Data type: tabular - Synthetic method: rule-based + probabilistic sampling; fully synthetic - Annotation schema: Diagnosis, Outcome - Refresh frequency: on demand - Governance notes: default PII scan-only policy; regen/redact only on real-risk hits; uniqueness and k-anonymity proxies; synthetic-only ## Generation Information - Generated: 2025-09-29T13:27:01.560071 - Records: 5,000 - Generation time: 2.38 seconds - Random seed: 4227293827 - Output format: parquet (snappy compression) ## Dataset Statistics - Rows: 5,000 - Columns: 73 - Overall completeness: 100.0% - Columns with nulls: 0 - Rows with nulls: 0 - Exact duplicates: 0 - Numeric columns summary: present ## Performance & Storage - File size: 0.8 MB - Compression ratio: 0.04x - Memory usage: 20.5 MB - Sample load time: 0.06 seconds - Processing complexity: low ## Temporal Coverage - Visit_Date: 2023-09-29 to 2025-09-28 (730 days, 731 unique dates) ## Demographic Analysis - Gender distribution: Female: 49.9%, Male: 47.9%, Non-binary: 2.2% - Age range: 0.0-85.0 years (median: 43.0) - Age Distribution: count: 5000.0, mean: 43.6, std: 24.0 - Provider Type Distribution: Primary Care: 2938, Specialist: 995, Emergency: 803 - Insurance Type Distribution: Private: 2193, Medicare: 1285, Medicaid: 1021 ## Age Distribution min 0, p10 11, p25 23, p50 43, p75 64, p90 77, max 85 ## Top Diagnoses (ICD-10-like) Healthy: 1241, Common Cold: 605, Hypertension: 502, Diabetes: 436, Influenza: 401, Depression: 316, Heart Disease: 306, Asthma: 254, Pneumonia: 224, Cancer: 219 ## Outcome Distribution Recovered: 2858, Ongoing Treatment: 2038, Deteriorated: 85, Deceased: 19 ## Visit/Provider Types Primary Care: 2938, Specialist: 995, Emergency: 803, Urgent Care: 264 ## Data Relationships - Unique patients: 5,000 - Avg records per patient: 1.0 - Patients with multiple visits: 0 (0.0%) - Strong clinical correlations: - Heart_Rate vs Temperature: 0.30 - Blood_Pressure_Systolic vs Blood_Pressure_Diastolic: 0.34 ## Known Limitations - May not capture rare edge cases or longitudinal dependencies. - Texts are synthetic and may exhibit templated phrasing. - Imaging metadata excludes actual DICOM pixel data. ## Refresh & Versioning - Versioning: Semantic where possible (e.g., 1.0.0); pilot snapshots tagged vN in delivery. - Refresh cadence: On demand; upgrade path preserves prior versions for auditability. - Backwards compatibility: New fields are appended; breaking schema changes increment major version. ## Sample Preview (first 5–10 rows) | Patient_ID | Age | Gender | Race | Ethnicity | Marital_Status | Diagnosis | Symptoms | Heart_Rate | Blood_Pressure_Systolic | Blood_Pressure_Diastolic | Temperature | Respiratory_Rate | Oxygen_Saturation | Height_cm | Weight_kg | BMI | Smoking_Status | Alcohol_Use | Allergies | Pregnancy_Status | Glucose | Cholesterol | Hemoglobin | White_Blood_Cell_Count | Medications | Medications_Detailed | Procedures | Outcome | Visit_Date | Admit_DateTime | Discharge_DateTime | Provider_Type | Insurance_Type | Length_of_Stay | Readmission_Risk | zip_postal_code | county_region_state | country | Encounter_ID | Visit_Type | Department | Facility | Provider_Specialty | Provider_NPI | ICD10_Codes | ICD10_Descriptions | CPT_Codes | CPT_Descriptions | Regional_DX_Codes | Regional_PROC_Codes | Glucose_Unit | Glucose_Flag | Glucose_LOINC | Glucose_Specimen | Cholesterol_Unit | Cholesterol_Flag | Cholesterol_LOINC | Cholesterol_Specimen | Hemoglobin_Unit | Hemoglobin_Flag | Hemoglobin_LOINC | Hemoglobin_Specimen | WBC_Unit | WBC_Flag | WBC_LOINC | WBC_Specimen | HbA1c | CRP | D_Dimer | Platelets | Neutrophils_Abs | Lymphocytes_Abs | |:-------------|------:|:---------|:--------------------------|:-----------------------|:-----------------|:--------------|:----------------------------------------------------------------|-------------:|--------------------------:|---------------------------:|--------------:|-------------------:|--------------------:|------------:|------------:|------:|:------------------|:--------------|:------------------|:-------------------|----------:|--------------:|-------------:|-------------------------:|:---------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|:-------------|:-----------------|:---------------------|:----------------|:-----------------|-----------------:|-------------------:|------------------:|:----------------------|:--------------|:---------------|:-------------|:-------------|:--------------------------------|:---------------------|---------------:|:--------------|:-----------------------------------------------------------------------|:------------|:-------------------------------------------------------------------------------|:--------------------|:----------------------|:---------------|:---------------|:----------------|:-------------------|:-------------------|:-------------------|:--------------------|:-----------------------|:------------------|:------------------|:-------------------|:----------------------|:-----------|:-----------|:------------|:---------------|--------:|------:|----------:|------------:|------------------:|------------------:| | PAT83292630 | 64 | Female | Black or African American | Not Hispanic or Latino | Single | Anxiety | fatigue;headache | 64 | 108 | 80 | 36.2 | 18 | 96 | 152.8 | 54 | 23.1 | Former | Occasional | None | Not Applicable | 81.8 | 128.6 | 15.9 | 5.7 | None | [] | [{"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "103693007", "laterality": "Right", "performer_npi": "8553599826", "device": ""}, {"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "26544005", "laterality": "Bilateral", "performer_npi": "7454015743", "device": ""}] | Recovered | 2024-01-02 | 2024-01-02 | 2024-01-02 | Primary Care | Private | 7 | 0.1 | 45898 | New Jersey | United States | ENC32684679 | Outpatient | Primary Care | Peterburgh Medical Center | Family Medicine | 2521290379 | F41.9 | | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 5.9 | 1.3 | 506 | 454 | 6.1 | 1.7 | | PAT54653063 | 53 | Male | White | Not Hispanic or Latino | Single | Common Cold | runny_nose;sore_throat;cough;headache;fatigue | 98 | 109 | 90 | 36.5 | 14 | 95.4 | 179.2 | 88.2 | 27.5 | Never | None | Latex;NSAIDs | Not Applicable | 85.4 | 176.3 | 15.1 | 5 | None | [] | [] | Ongoing Treatment | 2023-10-14 | 2023-10-14 | 2023-10-16 | Emergency | Private | 3 | 0.1 | 02844 | New Mexico | United States | ENC55069425 | Inpatient | Emergency | Amberborough Medical Center | Internal Medicine | 9345586461 | J00 | Acute nasopharyngitis [common cold] | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 4.8 | 3.7 | 354 | 167 | 4.9 | 3.6 | | PAT83695879 | 72 | Male | Black or African American | Hispanic or Latino | Married | Cancer | fatigue;unexplained_weight_loss;pain;fever;headache;chest_pain | 96 | 126 | 63 | 36.3 | 17 | 96.7 | 173.4 | 76.8 | 25.5 | Never | Moderate | Peanuts | Not Applicable | 70.5 | 195.4 | 12.5 | 7.3 | None | [] | [{"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "103693007", "laterality": "Bilateral", "performer_npi": "9321916348", "device": ""}, {"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "26544005", "laterality": "Left", "performer_npi": "2474053068", "device": ""}] | Ongoing Treatment | 2024-07-01 | 2024-07-01 | 2024-07-03 | Primary Care | Medicaid | 29 | 0.45 | 25762 | Illinois | United States | ENC16994786 | Outpatient | Primary Care | North James Medical Center | Dermatology | 0630243680 | C80.1 | Malignant (primary) neoplasm, unspecified | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 5.1 | 43.4 | 265 | 298 | 4 | 2.7 | | PAT96743740 | 8 | Male | Unknown | Not Hispanic or Latino | Widowed | Healthy | None | 75 | 133 | 79 | 36.5 | 17 | 95.6 | 140 | 60.4 | 30.8 | Former | Heavy | None | Not Applicable | 123.3 | 133.2 | 12.4 | 8.6 | None | [] | [{"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "387713003", "laterality": "N/A", "performer_npi": "2911639839", "device": ""}] | Recovered | 2024-01-29 | 2024-01-29 | 2024-01-31 | Primary Care | Medicaid | 1 | 0.2 | 05164 | New Mexico | United States | ENC54937070 | Emergency | Primary Care | Priceborough Medical Center | Emergency Medicine | 8407358577 | Z00.00 | | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 10 | 3.1 | 400 | 468 | 7.7 | 4.9 | | PAT82741722 | 82 | Female | White | Not Hispanic or Latino | Married | Healthy | None | 67 | 129 | 82 | 37 | 17 | 98.3 | 156.1 | 82.7 | 33.9 | Current every day | None | None | Not Applicable | 131.3 | 148.3 | 15.7 | 7.8 | None | [] | [{"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "387713003", "laterality": "Right", "performer_npi": "7292753889", "device": ""}, {"cpt": "99213", "cpt_description": "Office/outpatient visit, est", "snomed": "387713003", "laterality": "N/A", "performer_npi": "2849075953", "device": ""}] | Recovered | 2024-01-30 | 2024-01-30 | 2024-02-01 | Emergency | Medicare | 0 | 0.3 | 62149 | Arizona | United States | ENC62924975 | Outpatient | Emergency | Port Jennifer Medical Center | Internal Medicine | 6871606279 | Z00.00 | | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 11 | 3.2 | 141 | 501 | 8 | 3.8 | | PAT34268492 | 59 | Female | Unknown | Not Hispanic or Latino | Single | Heart Disease | chest_pain;shortness_of_breath;fatigue;dizziness | 87 | 141 | 110 | 37.1 | 17 | 98.4 | 166.6 | 66.8 | 24.1 | Never | Occasional | None | Not Applicable | 119.5 | 158.7 | 15.7 | 6.5 | Atorvastatin;Aspirin | [{"name": "Atorvastatin", "rxnorm": "83367", "dose": "500 mg", "route": "PO", "frequency": "q6h", "intent": "maintenance", "prn": false, "indication": "Hypertension", "ndc": "12643-9428-44", "start_date": "2025-03-15", "end_date": "2025-05-26"}, {"name": "Aspirin", "rxnorm": "1191", "dose": "10 mg", "route": "SC", "frequency": "prn", "intent": "prophylaxis", "prn": true, "indication": "Diabetes", "ndc": "21023-5253-28", "start_date": "2025-01-31", "end_date": "2025-06-06"}] | [] | Recovered | 2025-05-27 | 2025-05-27 | 2025-05-28 | Emergency | Private | 5 | 0.25 | 29070 | Vermont | United States | ENC74399535 | Outpatient | Cardiology | North Brianville Medical Center | Radiology | 3102979720 | I25.10 | Atherosclerotic heart disease of native coronary artery without angina | 93000;99214 | Electrocardiogram, complete;Office/outpatient visit, est (moderate complexity) | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 6 | 7.6 | 1360 | 245 | 6 | 2.8 | | PAT72537467 | 18 | Female | Unknown | Hispanic or Latino | Married | Depression | fatigue;loss_of_interest;sleep_disturbances;mood_changes;nausea | 79 | 124 | 70 | 36.3 | 17 | 100 | 163.6 | 65.7 | 24.5 | Never | None | None | Not Pregnant | 118.5 | 140.3 | 13.1 | 5.4 | Bupropion;Sertraline | [{"name": "Bupropion", "rxnorm": "42347", "dose": "10 mg", "route": "Inhalation", "frequency": "bid", "intent": "prophylaxis", "prn": false, "indication": "Infection", "ndc": "40714-6911-19", "start_date": "2025-09-21", "end_date": "2025-09-29"}, {"name": "Sertraline", "rxnorm": "386209", "dose": "500 mg", "route": "SC", "frequency": "q12h", "intent": "maintenance", "prn": false, "indication": "Asthma", "ndc": "71336-7119-42", "start_date": "2024-12-04", "end_date": "2025-04-13"}] | [] | Recovered | 2025-07-11 | 2025-07-11 | 2025-07-12 | Primary Care | Medicaid | 13 | 0.1 | 15847 | Idaho | United States | ENC62528737 | Outpatient | Neurology | Josephstad Medical Center | Internal Medicine | 3942794429 | F32.9 | Major depressive disorder, single episode, unspecified | 90791;99213 | ;Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 6 | 9.2 | 293 | 289 | 9.5 | 2.3 | | PAT53931288 | 35 | Male | White | Not Hispanic or Latino | Other | Healthy | None | 100 | 116 | 83 | 37 | 20 | 97.6 | 179.3 | 71.6 | 22.3 | Former | None | None | Not Applicable | 135.6 | 135.1 | 15.4 | 6 | None | [] | [] | Recovered | 2024-05-13 | 2024-05-13 | 2024-05-14 | Primary Care | Medicare | 1 | 0.1 | 42450 | Vermont | United States | ENC90564044 | Telehealth | Primary Care | Bryanview Medical Center | Orthopedic Surgery | 1924366319 | Z00.00 | | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 5.3 | 3.2 | 312 | 369 | 5.6 | 4 | | PAT48781189 | 74 | Female | White | Not Hispanic or Latino | Married | Healthy | None | 60 | 140 | 86 | 36.7 | 14 | 97.5 | 152.1 | 70.4 | 30.4 | Never | Heavy | Sulfa drugs | Not Applicable | 130.5 | 143.5 | 15.6 | 6.5 | None | [] | [] | Recovered | 2024-08-26 | 2024-08-26 | 2024-08-27 | Emergency | Medicaid | 1 | 0.3 | 85237 | North Dakota | United States | ENC81528144 | Outpatient | Orthopedics | South Stevenside Medical Center | Pediatrics | 2241887742 | Z00.00 | | 99213 | Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 7.9 | 4.5 | 378 | 307 | 6.1 | 2.7 | | PAT98519411 | 35 | Male | Other | Hispanic or Latino | Married | Asthma | shortness_of_breath;cough;wheezing;chest_tightness | 90 | 119 | 76 | 36.4 | 16 | 98.2 | 166.3 | 67.4 | 24.4 | Former | Moderate | Penicillin;NSAIDs | Not Applicable | 80.5 | 160.3 | 13.7 | 9.9 | Fluticasone | [{"name": "Fluticasone", "rxnorm": "205904", "dose": "500 mg", "route": "SC", "frequency": "tid", "intent": "treatment", "prn": false, "indication": "Infection", "ndc": "99135-2710-57", "start_date": "2025-05-15", "end_date": "2025-09-12"}] | [] | Ongoing Treatment | 2024-05-17 | 2024-05-17 | 2024-05-18 | Primary Care | Private | 4 | 0.1 | 16917 | Illinois | United States | ENC96964276 | Outpatient | Primary Care | Port Gracebury Medical Center | Psychiatry | 3699676188 | J45.909 | Unspecified asthma, uncomplicated | 94010;99213 | Spirometry;Office/outpatient visit, est | | | mg/dL | N | 2345-7 | Serum | mg/dL | N | 2093-3 | Serum | g/dL | N | 718-7 | Blood | 10^3/uL | N | 6690-2 | Blood | 6 | 0.2 | 406 | 586 | 9.3 | 3 | ## Governance and Safety - Outputs are fully synthetic (de novo). PII scanning runs on every build. - Default policy: scan-only; regenerate or redact only if a real-risk hit is detected. - Uniqueness and k-anonymity proxies monitored. - Intended for research, development, and demonstrations; not for clinical use.