Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, UNC-Chapel Hill, Chapel Hill, NC 27599-7264, USA.
Biological and Biomedical Sciences Program, University of North Carolina at Chapel Hill, 130 Mason Farm Road, UNC-Chapel Hill Chapel Hill, NC 27599-7264, USA.
Syst Biol. 2020 Mar 1;69(2):221-233. doi: 10.1093/sysbio/syz060.
Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several "zones" of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.
重建物种之间的系统发育关系是进化生物学中最具挑战性的任务之一。有多种方法可以重建系统发育树,每种方法都有其自身的优缺点。模拟和实证研究都确定了几个“参数空间区域”,在这些区域中,某些方法的准确性会大幅下降,即使是对于四分类树也是如此。此外,一些方法可能具有不理想的统计特性,例如统计不一致性和/或正向误导的倾向(即断言对错误的树拓扑有很强的支持)。最近,深度学习技术在生物学研究的许多新的和长期存在的问题上都取得了进展。在这项研究中,我们设计了一个深度卷积神经网络(CNN),从多个序列比对中推断四联体拓扑结构。这个 CNN 可以很容易地接受训练,以便使用有缺口和无缺口的数据进行推断。我们表明,我们的方法在模拟数据上具有很高的准确性,通常优于传统方法,并且对参数空间中的偏差诱导区域(如费尔斯坦区域和法里斯区域)具有很强的鲁棒性。我们还表明,我们的 CNN 产生的置信分数比传统方法的自举和后验概率分数更能准确评估所选拓扑的支持程度。尽管仍然存在许多实际挑战,但这些发现表明,深度学习方法(如我们的方法)有可能产生更准确的系统发育推断。