T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images | Center for Devices and Radiological Health

Catalog of Regulatory Science Tools to Help Assess New Medical Devices

T-SYNTH is a synthetic dataset of paired DM (2D imaging) and DBT (3D imaging) images derived from a Knowledge Based (KB) model, with lesion bounding boxes and pixel-level tissue segmentations of a variety of breast tissues.

Technical Description

T-SYNTH [1] is a dataset that consists of 9,000 synthetic digital breast tomosynthesis (DBT) examples, with 300 examples per cohort type (combination of breast density, mass radius, mass density), and is designed to extend the M-SYNTH synthetic DM dataset [2] with paired DBT images as well as pixel-level annotations of tissues. Each example consists of an image (in RAW and DICOM formats), image-level annotation, mass location, and a pixel-level segmentation of the mass and tissues.

The dataset has the following cohort characteristics:

• Breast density: dense, heterogeneously dense, scattered, fatty,

• Mass radius (mm): 5.00, 7.00, 9.00,

• Mass density: 1.0, 1.06, 1.1 (ratio of radiodensity of the mass to that of fibroglandular tissue),

Similar to M-SYNTH, T-SYNTH was simulated using the VICTRE pipeline [2] (see VICTRE Github Page and FDA Regulatory Science Tools (RST) Catalog for additional information) for generating breast models and their corresponding DM images.

Intended Purpose

T-SYNTH is a synthetic DBT dataset that can be used to develop (train or pre-train) or comparatively test AI algorithms for segmentation, detection, and/or classification of breast tissues. As a proof of concept, we experimented with T-SYNTH as a tool for augmenting development datasets and as a subgroup (breast density, mass size and density) performance test benchmark for lesion detection AI performance. SYNTH cannot fully replace real patient data for the evaluation of breast imaging AI.

Testing

T-SYNTH synthetic images exhibit clinically relevant and expected trends with respect to lesion visibility and properties. For example, lesions are less distinct from the background fibroglandular tissue in higher breast density categories (i.e., dense breast tissue mimics and therefore obscures lesions) and become more prominent as their diameter increases. T-SYNTH DBT examples provide a clearer visualization of lesions compared to corresponding DM images.

Using T-SYNTH and M-SYNTH, we studied the effect of breast density, lesion size, and lesion density on detection of lesions in DBT and DM images. We also evaluated how supplementing training data using T-SYNTH affect performance on patient data and compared T-SYNTH to a generative AI baseline (diffusion model).

We trained a Faster R-CNN [3] neural network on T-SYNTH and real patient data on the task of mass detection (i.e., localize a lesion) to perform a task-based assessment of T-SYNTH. The experiment was performed on C-VIEW images derived from T-SYNTH in order to compare performance to patient data (since majority of public datasets are limited for C-VIEW images). Performance was then quantitatively evaluated using the free-response receiver operator characteristic (FROC) curve. We observed that performance improved with decreasing breast density, as well as increasing lesion size and density. When T-SYNTH synthetic data was used to augment limited patient data during training, we found that combining synthetic data with patient data matched or exceeded performance of patient data only in some cases. When comparing T-SYNTH data with a diffusionSDXL diffusion model [4], we found that T-SYNTH was superior to the un-finetuned model but inferior to the finetuned model.

Limitations

T-SYNTH and the object-based simulation methods used for image generation are constrained to the variability captured by the parameter space of the object models for anatomy, pathology and the acquisition system. This complexity may need to be adjusted depending on the questions to be investigated. Synthetic data in general may not be fully representative of the true patient population, and potential risks of testing using such data is missing the variability observed in patient populations or misjudging model performance due to a domain gap between data from real patients real and synthetic examples.

Supporting Documentation

The dataset is hosted on Huggingface (with version control enabled) with accompanying code available in Github.

Dataset: https://huggingface.co/datasets/didsr/tsynth

Github: https://github.com/DIDSR/tsynth-release

User Manual: https://github.com/DIDSR/tsynth-release/blob/main/README.md

License: Creative Commons 1.0 Universal License (CC0)

Reference

Christopher Wiedeman*, Anastasiia Sarmakeeva*, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana Delfino, Aldo Badano. T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images. Medical Image Computing and Computer Assisted Intervention (MICCAI) Open Data. 2025. (*- equal contribution)
M-SYNTH: A Dataset for the Comparative Evaluation of Mammography AI. FDA RST Catalog.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems (NeurIPS) 2015.
Dustin Podell, Dustin, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis." International Conference on Learning Representations (ICLR) 2024.

Contact

RST_CDRH@fda.hhs.gov

Tool Reference

RST Reference Number: RST26AI04.01
Date of Publication: 05/04/2026
Recommended Citation: U.S. Food and Drug Administration. (2026). T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images (RST26AI04.01). https://cdrh-rst.fda.gov/t-synth-knowledge-based-dataset-synthetic-breast-images