Gladstone Institutes, San Francisco, CA, USA.
Department of Genetics, Stanford University, Stanford, CA, USA.
Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
今天,可用的遗传、表观基因组学、转录组学、化学信息学和蛋白质组学数据的规模,加上易于使用的机器学习 (ML) 工具包,推动了监督学习在基因组学研究中的应用。然而,ML 软件中的统计模型和性能评估背后的假设在生物系统中经常得不到满足。在这篇综述中,我们举例说明了在基因组学中应用监督 ML 时遇到的几个常见陷阱的影响。我们探讨了基因组学数据的结构如何影响性能评估和预测。为了解决将最先进的 ML 方法应用于基因组学所带来的挑战,我们描述了一些解决方案和适当的用例,在这些用例中,ML 建模显示出了巨大的潜力。