Mahbub Sazan, Sawmya Shashata, Saha Arpita, Reaz Rezwana, Rahman M Sohel, Bayzid Md Shamsuzzoha
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.
Department of Computer Science, University of Maryland, College Park, Maryland, USA.
J Comput Biol. 2022 Nov;29(11):1156-1172. doi: 10.1089/cmb.2022.0212. Epub 2022 Sep 1.
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
物种树估计通常基于系统发育基因组学方法,该方法使用来自整个基因组的多个基因。然而,由于多种原因(从抽样偏差到更多生物学原因,如基因的产生和丢失),基因树往往是不完整的,这意味着并非所有感兴趣的物种都有一组共同的基因。不完整的基因树可能会影响系统发育基因组学推断的准确性。我们首次提出了由一组不完整基因树诱导的四重奏分布的插补问题,这涉及将缺失的四重奏添加回四重奏分布中。我们提出了基于深度学习的四重奏基因树插补方法(QT-GILD),这是一种自动化且专门定制的无监督深度学习技术,并结合自然语言处理的线索,它可以学习给定的一组不完整基因树中的四重奏分布,并相应地生成一组完整的四重奏。QT-GILD是一种通用技术,无需对主题系统进行显式建模,也无需考虑数据缺失或基因树异质性的原因。对一组模拟和实证数据集的实验研究表明,QT-GILD可以有效地插补四重奏分布,从而显著提高物种树的准确性。值得注意的是,QT-GILD不仅可以插补缺失的四重奏,还可以考虑基因树估计误差。因此,面对数据缺失的情况,QT-GILD推动了从基因树估计物种树的技术水平。