parSMURF，一种用于全基因组致病性变异检测的高性能计算工具。

parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants.

机构信息

Università degli Studi di Milano, AnacletoLab - Dipartimento di Informatica, via Giovanni Celoria 18, 20135 Milano, Italy.

Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Straße 2, 10178 Berlin, Germany.

出版信息

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa052.

DOI:10.1093/gigascience/giaa052

PMID:32444882

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7244787/

Abstract

BACKGROUND

Several prediction problems in computational biology and genomic medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority with respect to negative examples. For instance, deleterious or pathogenic variants are overwhelmed by the sea of neutral variants in the non-coding regions of the genome: thus, the prediction of deleterious variants is a challenging, highly imbalanced classification problem, and classical prediction tools fail to detect the rare pathogenic examples among the huge amount of neutral variants or undergo severe restrictions in managing big genomic data.

RESULTS

To overcome these limitations we propose parSMURF, a method that adopts a hyper-ensemble approach and oversampling and undersampling techniques to deal with imbalanced data, and parallel computational techniques to both manage big genomic data and substantially speed up the computation. The synergy between Bayesian optimization techniques and the parallel nature of parSMURF enables efficient and user-friendly automatic tuning of the hyper-parameters of the algorithm, and allows specific learning problems in genomic medicine to be easily fit. Moreover, by using MPI parallel and machine learning ensemble techniques, parSMURF can manage big data by partitioning them across the nodes of a high-performance computing cluster. Results with synthetic data and with single-nucleotide variants associated with Mendelian diseases and with genome-wide association study hits in the non-coding regions of the human genome, involhing millions of examples, show that parSMURF achieves state-of-the-art results and an 80-fold speed-up with respect to the sequential version.

CONCLUSIONS

parSMURF is a parallel machine learning tool that can be trained to learn different genomic problems, and its multiple levels of parallelization and high scalability allow us to efficiently fit problems characterized by big and imbalanced genomic data. The C++ OpenMP multi-core version tailored to a single workstation and the C++ MPI/OpenMP hybrid multi-core and multi-node parSMURF version tailored to a High Performance Computing cluster are both available at https://github.com/AnacletoLAB/parSMURF.

摘要

背景

计算生物学和基因组医学中的几个预测问题都具有大数据和学习样本之间高度不平衡的特点，其中阳性样本相对于阴性样本可以代表很小的一部分。例如，在基因组的非编码区域中，有害或致病变体被大量中性变体所淹没：因此，有害变体的预测是一个具有挑战性的、高度不平衡的分类问题，传统的预测工具无法在大量中性变体中检测到罕见的致病实例，或者在处理大型基因组数据时受到严重限制。

结果

为了克服这些限制，我们提出了 parSMURF 方法，该方法采用超集成方法和过采样和欠采样技术来处理不平衡数据，以及并行计算技术来管理大型基因组数据并大大加快计算速度。贝叶斯优化技术和 parSMURF 的并行性之间的协同作用使算法的超参数的高效和用户友好的自动调整成为可能，并允许对基因组医学中的特定学习问题进行轻松拟合。此外，通过使用 MPI 并行和机器学习集成技术，parSMURF 可以通过将大数据分割到高性能计算集群的节点上来管理大数据。使用合成数据以及与单核苷酸变体相关的孟德尔疾病和人类基因组非编码区域中的全基因组关联研究命中的结果，涉及数百万个实例，表明 parSMURF 达到了最先进的结果，并与顺序版本相比实现了 80 倍的加速。

结论

parSMURF 是一种并行机器学习工具，可以针对不同的基因组问题进行训练，其多层次的并行化和高度可扩展性允许我们有效地拟合具有大数据和不平衡基因组数据的问题。针对单个工作站的 C++ OpenMP 多核版本和针对高性能计算集群的 C++ MPI/OpenMP 混合多核和多节点 parSMURF 版本都可以在 https://github.com/AnacletoLAB/parSMURF 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/342e/7244787/1878c077246d/giaa052fig1.jpg

相似文献

parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants.parSMURF，一种用于全基因组致病性变异检测的高性能计算工具。

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa052.

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.基于不平衡学习的罕见病和常见病相关非编码变异预测

Sci Rep. 2017 Jun 7;7(1):2959. doi: 10.1038/s41598-017-03011-5.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Machine learning random forest for predicting oncosomatic variant NGS analysis.机器学习随机森林预测肿瘤体细胞变异 NGS 分析。

Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data.基于机器学习的基因组预测：在合成数据和实际数据上，正则化回归、集成、基于实例和深度学习方法的性能比较。

BMC Genomics. 2024 Feb 7;25(1):152. doi: 10.1186/s12864-023-09933-x.

Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data.利用大规模基因数据进行管道设计，以识别关键特征并对肺癌患者的化疗反应进行分类。

BMC Syst Biol. 2018 Nov 20;12(Suppl 5):97. doi: 10.1186/s12918-018-0615-5.

DrivR-Base: a feature extraction toolkit for variant effect prediction model construction.DrivR-Base：用于构建变异效应预测模型的特征提取工具包。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae197.

ACID: Association Correction for Imbalanced Data in GWAS.ACID：GWAS 中不平衡数据的关联校正。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Jan-Feb;15(1):316-322. doi: 10.1109/TCBB.2016.2608819. Epub 2016 Sep 13.

Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications.面向高性能计算的生物信息学应用学习算法并行实现

BMC Bioinformatics. 2014;15 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-15-S5-S2. Epub 2014 May 6.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark：一种可扩展的基于 Spark 的单倍型调用程序，利用自适应数据分段来加速变异调用。

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

引用本文的文献

AI-powered precision medicine: utilizing genetic risk factor optimization to revolutionize healthcare.人工智能驱动的精准医学：利用遗传风险因素优化彻底改变医疗保健。

NAR Genom Bioinform. 2025 May 5;7(2):lqaf038. doi: 10.1093/nargab/lqaf038. eCollection 2025 Jun.

Molecular Dynamics Investigations of Human DNA-Topoisomerase I Interacting with Novel Dewar Valence Photo-Adducts: Insights into Inhibitory Activity.分子动力学研究新型 Dewar 价光加合物与人 DNA-拓扑异构酶 I 的相互作用：对抑制活性的深入了解。

Int J Mol Sci. 2023 Dec 23;25(1):234. doi: 10.3390/ijms25010234.

Risk Factor Analysis of Cryopreserved Autologous Bone Flap Resorption in Adult Patients Undergoing Cranioplasty with Volumetry Measurement Using Conventional Statistics and Machine-Learning Technique.采用传统统计学和机器学习技术进行体积测量的成年颅骨成形术患者自体冷冻骨瓣吸收的危险因素分析

J Korean Neurosurg Soc. 2024 Jan;67(1):103-114. doi: 10.3340/jkns.2023.0143. Epub 2023 Sep 15.

MD-Ligand-Receptor: A High-Performance Computing Tool for Characterizing Ligand-Receptor Binding Interactions in Molecular Dynamics Trajectories.MD-Ligand-Receptor：一种用于在分子动力学轨迹中描述配体-受体结合相互作用的高性能计算工具。

Int J Mol Sci. 2023 Jul 19;24(14):11671. doi: 10.3390/ijms241411671.

The Regulatory Mendelian Mutation score for GRCh38.GRCh38 的调控孟德尔突变评分。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad024. Epub 2023 Apr 21.

Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques.通过深度学习和贝叶斯优化技术提高活性顺式调控区域的组织特异性预测。

BMC Bioinformatics. 2022 Dec 12;23(Suppl 2):154. doi: 10.1186/s12859-022-04582-5.

Interpretable prioritization of splice variants in diagnostic next-generation sequencing.可解释的剪接变异体优先排序在诊断下一代测序中。

Am J Hum Genet. 2021 Sep 2;108(9):1564-1577. doi: 10.1016/j.ajhg.2021.06.014. Epub 2021 Jul 21.

本文引用的文献

Artificial intelligence powers digital medicine.人工智能推动数字医学发展。

NPJ Digit Med. 2018 Mar 14;1:5. doi: 10.1038/s41746-017-0012-2. eCollection 2018.

NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans.NCBoost 通过在人类中对净化选择信号进行监督学习，对孟德尔疾病中的致病性非编码变体进行分类。

Genome Biol. 2019 Feb 11;20(1):32. doi: 10.1186/s13059-019-1634-2.

CADD: predicting the deleteriousness of variants throughout the human genome.CADD：预测整个人类基因组中变异的有害性。

Nucleic Acids Res. 2019 Jan 8;47(D1):D886-D894. doi: 10.1093/nar/gky1016.

Next-Generation Sequencing to Diagnose Suspected Genetic Disorders.下一代测序技术用于诊断疑似遗传疾病。

N Engl J Med. 2018 Oct 4;379(14):1353-1362. doi: 10.1056/NEJMra1711801.

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk.基于深度学习的序列从头预测变异对表达和疾病风险的影响。

Nat Genet. 2018 Aug;50(8):1171-1179. doi: 10.1038/s41588-018-0160-6. Epub 2018 Jul 16.

Regulatory variants: from detection to predicting impact.调控变体：从检测到预测影响。

Brief Bioinform. 2019 Sep 27;20(5):1639-1654. doi: 10.1093/bib/bby039.

The 100 000 Genomes Project: bringing whole genome sequencing to the NHS.“十万基因组计划”：将全基因组测序引入英国国家医疗服务体系。

BMJ. 2018 Apr 24;361:k1687. doi: 10.1136/bmj.k1687.

Deep learning of genomic variation and regulatory network data.基因组变异和调控网络数据的深度学习。

Hum Mol Genet. 2018 May 1;27(R1):R63-R71. doi: 10.1093/hmg/ddy115.

Whole genome sequencing analysis for cancer genomics and precision medicine.用于癌症基因组学和精准医学的全基因组测序分析。

Cancer Sci. 2018 Mar;109(3):513-522. doi: 10.1111/cas.13505. Epub 2018 Feb 26.

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants.基于不平衡学习的罕见病和常见病相关非编码变异预测

Sci Rep. 2017 Jun 7;7(1):2959. doi: 10.1038/s41598-017-03011-5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

parSMURF，一种用于全基因组致病性变异检测的高性能计算工具。

parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献