{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T13:10:31Z","timestamp":1769519431450,"version":"3.49.0"},"reference-count":43,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T00:00:00Z","timestamp":1752537600000},"content-version":"vor","delay-in-days":14,"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Summary<\/jats:title>\n                  <jats:p>In this paper, we introduce the first diffusion model designed to generate complete synthetic human genotypes, which, by standard protocols, one can straightforwardly expand into full-length, DNA-level genomes. The synthetic genotypes mimic real human genotypes without just reproducing known genotypes, in terms of approved metrics. When training biomedically relevant classifiers with synthetic genotypes, accuracy is near-identical to the accuracy achieved when training classifiers with real data. We further demonstrate that augmenting small amounts of real with synthetically generated genotypes drastically improves performance rates. This addresses a significant challenge in translational human genetics: real human genotypes, although emerging in large volumes from genome wide association studies, are sensitive private data, which limits their public availability. Therefore, the integration of additional, insensitive data when striving for rapid sharing of biomedical knowledge of public interest appears imperative.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>All non proprietary data and the code to replicate the experiments is available on Github.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf209","type":"journal-article","created":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:03:05Z","timestamp":1752584585000},"page":"i484-i492","source":"Crossref","is-referenced-by-count":3,"title":["Generating synthetic genotypes using diffusion models"],"prefix":"10.1093","volume":"41","author":[{"given":"Philip","family":"Kenneweg","sequence":"first","affiliation":[{"name":"AG Machine Learning, Bielefeld University , Bielefeld, NRW 33615,","place":["Germany"]}]},{"given":"Raghuram","family":"Dandinasivara","sequence":"additional","affiliation":[{"name":"AG Genome Data Science, Bielefeld University , Bielefeld, NRW 33615,","place":["Germany"]}]},{"ORCID":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/orcid.org\/0000-0002-3914-359X","authenticated-orcid":false,"given":"Xiao","family":"Luo","sequence":"additional","affiliation":[{"name":"College of Biology, Hunan University , Hunan, Hunan Province 410082,","place":["China"]}]},{"given":"Barbara","family":"Hammer","sequence":"additional","affiliation":[{"name":"AG Machine Learning, Bielefeld University , Bielefeld, NRW 33615,","place":["Germany"]}]},{"ORCID":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/orcid.org\/0000-0003-3529-0856","authenticated-orcid":false,"given":"Alexander","family":"Sch\u00f6nhuth","sequence":"additional","affiliation":[{"name":"AG Genome Data Science, Bielefeld University , Bielefeld, NRW 33615,","place":["Germany"]}]}],"member":"286","published-online":{"date-parts":[[2025,7,15]]},"reference":[{"key":"2025071509025851300_btaf209-B1","author":"Ahronoviz","year":"2024"},{"key":"2025071509025851300_btaf209-B2","doi-asserted-by":"crossref","first-page":"794","DOI":"10.1016\/j.ajhg.2012.08.031","article-title":"Imputation of exome sequence variants into population-based samples and blood-cell-trait-associated loci in African Americans: NHLBI go exome sequencing project","volume":"91","author":"Auer","year":"2012","journal-title":"Am J Hum Genet"},{"key":"2025071509025851300_btaf209-B3","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Auton","year":"2015","journal-title":"Nature"},{"key":"2025071509025851300_btaf209-B4","first-page":"1276","author":"Avdeyev","year":"2023"},{"key":"2025071509025851300_btaf209-B5","author":"Azizi","year":"2023"},{"key":"2025071509025851300_btaf209-B6","author":"Burnard","year":"2023"},{"key":"2025071509025851300_btaf209-B7","doi-asserted-by":"crossref","first-page":"eadg7492","DOI":"10.1126\/science.adg7492","article-title":"Accurate proteome-wide missense variant effect prediction with alphamissense","volume":"381","author":"Cheng","year":"2023","journal-title":"Science"},{"key":"2025071509025851300_btaf209-B8","author":"Dang","year":"2023"},{"key":"2025071509025851300_btaf209-B9","doi-asserted-by":"publisher","author":"Devlin","year":"2019","DOI":"10.18653\/V1\/N19-1423"},{"key":"2025071509025851300_btaf209-B10","first-page":"8780","article-title":"Diffusion models beat GANS on image synthesis","volume":"34","author":"Dhariwal","year":"2021","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025071509025851300_btaf209-B11","volume-title":"Transact Mach Learn Res","author":"Dockhorn","year":"2023"},{"key":"2025071509025851300_btaf209-B12","doi-asserted-by":"crossref","first-page":"1895","DOI":"10.1101\/gr.225672.117","article-title":"Detection of long repeat expansions from PCR-free whole-genome sequence data","volume":"27","author":"Dolzhenko","year":"2017","journal-title":"Genome Res"},{"key":"2025071509025851300_btaf209-B13","author":"Dosovitskiy","year":"2021"},{"key":"2025071509025851300_btaf209-B14","first-page":"8717","author":"Duan","year":"2023"},{"key":"2025071509025851300_btaf209-B15","first-page":"17","author":"Fredrikson","year":"2014"},{"key":"2025071509025851300_btaf209-B16","doi-asserted-by":"crossref","first-page":"136","DOI":"10.1038\/s44222-023-00114-9","article-title":"Diffusion models in bioinformatics and computational biology","volume":"2","author":"Guo","year":"2024","journal-title":"Nat Rev Bioeng"},{"key":"2025071509025851300_btaf209-B17","first-page":"6629","article-title":"Gans trained by a two time-scale update rule converge to a local NASH equilibrium","volume":"30","author":"Heusel","year":"2017","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025071509025851300_btaf209-B18","article-title":"Classifier-free diffusion guidance","author":"Ho","year":"2022"},{"key":"2025071509025851300_btaf209-B19","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2025071509025851300_btaf209-B20","author":"Li","year":"2023"},{"key":"2025071509025851300_btaf209-B21","first-page":"114","author":"Luo","year":"2023"},{"key":"2025071509025851300_btaf209-B22","author":"Nguyen","year":"2023"},{"key":"2025071509025851300_btaf209-B23","first-page":"1379","author":"Perera","year":"2022"},{"key":"2025071509025851300_btaf209-B24","doi-asserted-by":"crossref","first-page":"1537","DOI":"10.1038\/s41431-018-0177-4","article-title":"Project mine: study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis","volume":"26","author":"Project MinE ALS Sequencing Consortium","year":"2018","journal-title":"Eur J Hum Genet"},{"key":"2025071509025851300_btaf209-B25","first-page":"10684","author":"Rombach","year":"2022"},{"key":"2025071509025851300_btaf209-B26","first-page":"234","volume-title":"Medical Image Computing and Computer-Assisted Intervention \u2013 MICCAI 2015","author":"Ronneberger","year":"2015"},{"key":"2025071509025851300_btaf209-B27","first-page":"2234","article-title":"Improved techniques for training gans","volume":"29","author":"Salimans","year":"2016","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025071509025851300_btaf209-B28","article-title":"Designing DNA with tunable regulatory activity using discrete diffusion","author":"Sarkar","year":"2024"},{"key":"2025071509025851300_btaf209-B29","author":"Schiff","year":"2024"},{"key":"2025071509025851300_btaf209-B30","doi-asserted-by":"crossref","first-page":"1122","DOI":"10.1038\/gim.2017.247","article-title":"Are whole-exome and whole-genome sequencing approaches cost-effective? a systematic review of the literature","volume":"20","author":"Schwarze","year":"2018","journal-title":"Genet Med"},{"key":"2025071509025851300_btaf209-B31","article-title":"DNA-diffusion: leveraging generative models for controlling chromatin accessibility and gene expression via synthetic regulatory elements","author":"Senan","year":"2024"},{"key":"2025071509025851300_btaf209-B32","doi-asserted-by":"publisher","DOI":"10.3389\/fsysb.2022.877717","article-title":"A brief review on deep learning applications in genomic studies","volume":"2","author":"Shen","year":"2022","journal-title":"Front Syst Biol"},{"key":"2025071509025851300_btaf209-B33","first-page":"47783","article-title":"Understanding and mitigating copying in diffusion models","volume":"36","author":"Somepalli","year":"2023","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025071509025851300_btaf209-B34","author":"Song","year":"2021"},{"key":"2025071509025851300_btaf209-B35","first-page":"110","author":"Szatkownik","year":"2024"},{"key":"2025071509025851300_btaf209-B36","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025071509025851300_btaf209-B37","doi-asserted-by":"publisher","first-page":"btad535","DOI":"10.1093\/bioinformatics\/btad535","article-title":"HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes","volume":"39","author":"Wharrie","year":"2023","journal-title":"Bioinformatics"},{"key":"2025071509025851300_btaf209-B38","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1038\/s41586-023-06887-8","article-title":"Discovery of a structural class of antibiotics with explainable deep learning","volume":"626","author":"Wong","year":"2024","journal-title":"Nature"},{"key":"2025071509025851300_btaf209-B39","first-page":"244","volume-title":"Neurocomputing","author":"Yale"},{"key":"2025071509025851300_btaf209-B40","doi-asserted-by":"crossref","first-page":"e1011584","DOI":"10.1371\/journal.pcbi.1011584","article-title":"Deep convolutional and conditional neural networks for large-scale genomic data generation","volume":"19","author":"Yelmen","year":"2023","journal-title":"PLoS Comput Biol"},{"key":"2025071509025851300_btaf209-B41","doi-asserted-by":"crossref","first-page":"e1009303","DOI":"10.1371\/journal.pgen.1009303","article-title":"Creating artificial human genomes using generative neural networks","volume":"17","author":"Yelmen","year":"2021","journal-title":"PLoS Genet"},{"key":"2025071509025851300_btaf209-B42","author":"Zhang","year":"2020"},{"key":"2025071509025851300_btaf209-B43","doi-asserted-by":"crossref","DOI":"10.1101\/2023.07.11.548628","article-title":"Dnagpt: a generalized pretrained tool for multiple DNA sequence analysis tasks","author":"Zhang","year":"2023"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i484\/63745659\/btaf209.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i484\/63745659\/btaf209.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:03:07Z","timestamp":1752584587000},"score":1,"resource":{"primary":{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/academic.oup.com\/bioinformatics\/article\/41\/Supplement_1\/i484\/8199400"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":43,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/doi.org\/10.1093\/bioinformatics\/btaf209","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,7,1]]}}}