Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853.
Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853;
Proc Natl Acad Sci U S A. 2019 Mar 19;116(12):5542-5549. doi: 10.1073/pnas.1814551116. Epub 2019 Mar 6.
Deep learning methodologies have revolutionized prediction in many fields and show potential to do the same in molecular biology and genetics. However, applying these methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions. We developed two approaches that account for evolutionary relatedness in machine learning models: () gene-family-guided splitting and () ortholog contrasts. The first approach accounts for evolution by constraining model training and testing sets to include different gene families. The second approach uses evolutionarily informed comparisons between orthologous genes to both control for and leverage evolutionary divergence during the training process. The two approaches were explored and validated within the context of mRNA expression level prediction and have the area under the ROC curve (auROC) values ranging from 0.75 to 0.94. Model weight inspections showed biologically interpretable patterns, resulting in the hypothesis that the 3' UTR is more important for fine-tuning mRNA abundance levels while the 5' UTR is more important for large-scale changes.
深度学习方法已经彻底改变了许多领域的预测,并显示出在分子生物学和遗传学中也具有同样的潜力。然而,将这些方法应用于其当前形式忽略了生物系统内的进化依赖性,可能导致假阳性和错误的结论。我们开发了两种方法来解决机器学习模型中的进化相关性问题:(1)基因家族指导分割和(2)直系同源物对比。第一种方法通过限制模型训练和测试集来包含不同的基因家族,从而考虑进化。第二种方法使用进化信息在直系同源基因之间进行比较,既可以在训练过程中控制进化分歧,又可以利用进化分歧。这两种方法在 mRNA 表达水平预测的背景下进行了探索和验证,ROC 曲线下面积(auROC)值范围从 0.75 到 0.94。模型权重检查显示出具有生物学可解释性的模式,这导致了一个假设,即 3'UTR 对微调 mRNA 丰度水平更为重要,而 5'UTR 对大规模变化更为重要。