Minten Thomas, Bick Sarah, Adelson Sophia, Gehlenborg Nils, Amendola Laura M, Boemer François, Coffey Alison J, Encina Nicolas, Ferlini Alessandra, Kirschner Janbernd, Russell Bianca E, Servais Laurent, Sund Kristen L, Taft Ryan J, Tsipouras Petros, Zouk Hana, Bick David, Green Robert C, Gold Nina B
KU Leuven, Leuven, Belgium.
Boston Children's Hospital, Boston, MA; Massachusetts General Hospital, Boston, MA; Harvard Medical School, Boston, MA.
Genet Med. 2025 Jul;27(7):101443. doi: 10.1016/j.gim.2025.101443. Epub 2025 May 9.
Over 30 international studies are exploring newborn sequencing (NBSeq) to expand the range of genetic disorders included in newborn screening. Substantial variability in gene selection across programs exists, highlighting the need for a systematic approach to prioritize genes.
We assembled a data set comprising 25 characteristics about each of the 4390 genes included in 27 NBSeq programs. We used regression analysis to identify several predictors of inclusion and developed a machine learning model to rank genes for public health consideration.
Among 27 NBSeq programs, the number of genes analyzed ranged from 134 to 4299, with only 74 (1.7%) genes included by over 80% of programs. The most significant associations with gene inclusion across programs were presence on the US Recommended Uniform Screening Panel (inclusion increase of 74.7%, CI: 71.0%-78.4%), robust evidence on the natural history (29.5%, CI: 24.6%-34.4%), and treatment efficacy (17.0%, CI: 12.3%-21.7%) of the associated genetic disease. A boosted trees machine learning model using 13 predictors achieved high accuracy in predicting gene inclusion across programs (area under the curve = 0.915, R = 84%).
The machine learning model developed here provides a ranked list of genes that can adapt to emerging evidence and regional needs, enabling more consistent and informed gene selection in NBSeq initiatives.
超过30项国际研究正在探索新生儿测序(NBSeq),以扩大新生儿筛查中包含的遗传疾病范围。各项目之间在基因选择上存在很大差异,这凸显了采用系统方法对基因进行优先级排序的必要性。
我们收集了一个数据集,其中包含27个NBSeq项目中所涉及的4390个基因各自的25个特征。我们使用回归分析来确定几个入选的预测因素,并开发了一个机器学习模型,以便对基因进行排序,供公共卫生考虑。
在27个NBSeq项目中,分析的基因数量从134个到4299个不等,只有74个(1.7%)基因被超过80%的项目纳入。各项目中与基因入选最显著相关的因素包括在美国推荐统一筛查小组中的存在情况(入选率提高74.7%,置信区间:71.0%-78.4%)、相关遗传疾病自然史的有力证据(29.5%,置信区间:24.6%-34.4%)以及治疗效果(17.0%,置信区间:12.3%-21.7%)。一个使用13个预测因素的增强树机器学习模型在预测各项目中的基因入选情况时达到了很高的准确率(曲线下面积 = 0.915,R = 84%)。
此处开发的机器学习模型提供了一个基因排名列表,该列表能够适应新出现的证据和区域需求,从而在NBSeq计划中实现更一致、更明智的基因选择。