Gao Mu, Lund-Andersen Peik, Morehead Alex, Mahmud Sajid, Chen Chen, Chen Xiao, Giri Nabin, Roy Raj S, Quadir Farhan, Effler T Chad, Prout Ryan, Abraham Subil, Elwasif Wael, Haas N Quentin, Skolnick Jeffrey, Cheng Jianlin, Sedova Ada
Georgia Institute of Technology, Atlanta, GA.
University of Idaho, Moscow, ID.
Workshop Mach Learn HPC Environ. 2021 Nov;2021:46-57. doi: 10.1109/mlhpc54614.2021.00010. Epub 2021 Dec 27.
Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.
随着高性能计算(HPC)的出现,计算生物学是众多亟待创新和加速发展的科学学科之一。近年来,机器学习领域也从采用HPC实践中受益匪浅。在这项工作中,我们提出了一种新颖的HPC流程,该流程整合了各种机器学习方法,用于在全基因组规模上对蛋白质进行基于结构的功能注释。我们的流程广泛使用深度学习,并为训练针对蛋白质组学数据等高通量数据的先进深度学习模型的最佳实践提供计算见解。我们展示了我们的流程目前支持的方法,并详细说明了我们的流程未来需要涵盖的任务,包括使用SAdLSA进行大规模序列比较以及使用AlphaFold2预测蛋白质三级结构。