Methodology

Built for rigor. Designed for compliance. Transparent by default.

Data Generation

We generate fully synthetic datasets using probabilistic sampling, programmatic templates, and modality-specific rules. Structured records sample diagnoses, vitals, labs, medications, procedures, and encounters from realistic distributions; transcripts are template-driven with turn-level metadata; imaging metadata includes modality/body-region sampling, findings/impressions, and technical fields. Free-text fields are scanned for potential PII via regex by default, with an optional spaCy NER scanner. Default policy: scan-only; we regenerate or redact only on real-risk hits. Each run emits a PII scan report including a real_risk_summary, plus a schema.yaml and data-dictionary.csv. Structured records can optionally be exported as a FHIR-lite NDJSON bundle (Patient, Encounter, Condition, Observation).

  • Tabular: Probabilistic sampling with condition-dependent vitals, labs, meds, procedures
  • Text: Template-driven clinical dialogues with turn-level JSON; optional spaCy NER PII scan
  • Imaging metadata: Programmatic study metadata and radiology reports
  • Privacy layer: Regex-based PII scanning (default) or spaCy NER; default policy: scan-only; optional regen/redact on real-risk hits; `real_risk_summary` included in PII report; synthetic-only generation

Annotation Standards

Each dataset ships with rich, consistent metadata. We support ICD-10 and SNOMED coding for diagnoses and procedures, standardized visit types and care settings, and structured transcript annotations (speakers, segments, intents). Ontology alignment improves downstream interoperability and model evaluation.

  • ICD-10, SNOMED
  • Visit types & encounter metadata
  • Speaker/turn-level transcripts
  • Data cards and licensing info

Validation & Fidelity

We run internal statistical checks: distributions, numeric correlations, completeness, outliers, and label prevalence. We summarize demographics and patient linkage patterns. External benchmarking and leakage tests are not part of the current pipeline.

Age Distribution Summary
Age Distribution Summary
Clinical Correlations
Clinical Correlations
Visit Type Proportions
Visit Type Proportions
Demographic Summaries
Demographic Summaries

Governance & Compliance

Each run produces reproducible artifacts and governance docs: generation metadata (seed, timestamp, host), schema/domain validation, label prevalence, re-identification risk proxies, and Data Cards. Artifacts are versioned on disk and can be bundled for distribution.

  • Reproducible seeds and run metadata
  • Schema/domain and label prevalence checks
  • Re-identification risk proxies
  • Versioned artifacts (data/, samples/, artifacts/)

Artifacts & Evidence

The following charts are rendered from dataset artifacts. Source JSON is linked below.

Age Distribution Summary
Age Distribution Summary
Clinical Correlations (Top 10)
Clinical Correlations
Visit Type Proportions
Visit Type Proportions
Demographic Summaries
Demographic Summaries
PII Scan Counts
PII Scan Counts
Leakage Proxy Histogram
Leakage Proxy Histogram

Limitations

Synthetic data approximates real-world distributions but may under-represent rare conditions or extreme edge cases. We document known caveats and provide dataset-specific notes in our Data Cards to support responsible use.

See our datasets and Data Cards here.