Zhang Tao, Auer Paul, Spellman Stephen R, Dong Jing, Saber Wael, Bolon Yung-Tsi
CIBMTR® (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USA.
Division of Biostatistics, Institute for Health and Equity, Medical College of Wisconsin, Milwaukee, WI 53226, USA.
Life (Basel). 2025 Jun 9;15(6):929. doi: 10.3390/life15060929.
(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics.
(1) 背景:尽管全基因组测序(WGS)已能够对结构变异(SVs)进行全面分析,但仍需要更准确、高效的方法来区分传统上通过细胞遗传学检测发现的大型体细胞SVs(SV大小≥1 Mb)和种系SVs。(2) 方法:在Snakemake自动化工作流程下开发了一个定制的机器学习管道(CYTO-SV-ML),该管道带有用户界面,用于识别WGS数据中的体细胞细胞遗传学SVs。该工具被应用于表征骨髓增生异常综合征(MDSs)患者全血中的结构变异图谱。从成熟的开放数据库映射的已知SVs被分为训练子集和验证子集,用于CYTO-SV-ML管道中的自动机器学习模型。(3) 结果:CYTO-SV-ML管道在体细胞细胞遗传学SV分类方面的基准性能显示,易位的受试者工作特征曲线下面积(AUCROC)为0.94,非易位的为0.92;易位的敏感性为0.83,非易位的为0.85;易位的特异性为0.96,非易位的为0.82。在临床细胞遗传学记录的独立验证中,我们的方法(207个体细胞细胞遗传学SVs)优于传统的SV检测管道(143个体细胞细胞遗传学SVs)。此外,CYTO-SV-ML管道在55例临床细胞遗传学结果未成功的患者中的49例(89%)中发现了新的体细胞细胞遗传学SVs。(4) 结论:我们的研究证明了CYTO-SV-ML在从基因组测序数据进行SV分类基准测试方面的高性能机器学习方法,通过正交方法对新异常进行进一步验证对于释放其细胞遗传学诊断的全部临床潜力至关重要。