Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, Thailand.
Department of Biology, Faculty of Science, Khon Kaen University, Khon Kaen, Thailand.
IUBMB Life. 2022 Dec;74(12):1273-1287. doi: 10.1002/iub.2693. Epub 2022 Nov 17.
Predicting phenotypes and complex traits from genomic variations has always been a big challenge in molecular biology, at least in part because the task is often complicated by the influences of external stimuli and the environment on regulation of gene expression. With today's abundance of omic data and advances in high-throughput computing and machine learning (ML), we now have an unprecedented opportunity to uncover the missing links and molecular mechanisms that control gene expression and phenotypes. To empower molecular biologists and researchers in related fields to start using ML for in-depth analyses of their large-scale data, here we provide a summary of fundamental concepts of machine learning, and describe a wide range of research questions and scenarios in molecular biology where ML has been implemented. Due to the abundance of data, reproducibility, and genome-wide coverage, we focus on transcriptomics, and two ML tasks involving it: (a) predicting of transcriptomic profiles or transcription levels from genomic variations in DNA, and (b) predicting phenotypes of interest from transcriptomic profiles or transcription levels. Similar approaches can also be applied to more complex data such as those in multi-omic studies. We envisage that the concepts and examples described here will raise awareness and promote the application of ML among molecular biologists, and eventually help improve a framework for systematic design and predictions of gene expression and phenotypes for synthetic biology applications.
从基因组变异预测表型和复杂特征一直是分子生物学中的一个重大挑战,至少部分原因是,基因表达调控受到外部刺激和环境的影响,使得任务变得复杂。随着如今组学数据的丰富,以及高通量计算和机器学习(ML)的进步,我们现在有机会揭示控制基因表达和表型的缺失环节和分子机制。为了使分子生物学家和相关领域的研究人员能够开始使用 ML 对其大规模数据进行深入分析,我们在这里概述了机器学习的基本概念,并描述了 ML 在分子生物学中已经实现的广泛的研究问题和场景。由于数据丰富、可重复性和全基因组覆盖,我们专注于转录组学,以及涉及它的两个 ML 任务:(a)从 DNA 中的基因组变异预测转录组谱或转录水平,以及(b)从转录组谱或转录水平预测感兴趣的表型。类似的方法也可以应用于更复杂的数据,如多组学研究中的数据。我们设想,这里描述的概念和示例将提高分子生物学家对 ML 的认识,并促进其应用,最终有助于改进用于合成生物学应用的基因表达和表型的系统设计和预测框架。