Suppr超能文献

应用机器学习对基因重复的起源进行分类。

Applying Machine Learning to Classify the Origins of Gene Duplications.

机构信息

Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, USA.

出版信息

Methods Mol Biol. 2023;2545:91-119. doi: 10.1007/978-1-0716-2561-3_5.

Abstract

Nearly all lineages of land plants have experienced at least one whole-genome duplication (WGD) in their history. The legacy of these ancient WGDs is still observable in the diploidized genomes of extant plants. Genes originating from WGD-paleologs-can be maintained in diploidized genomes for millions of years. These paleologs have the potential to shape plant evolution through sub- and neofunctionalization, increased genetic diversity, and reciprocal gene loss among lineages. Current methods for classifying paleologs often rely on only a subset of potential genomic features, have varying levels of accuracy, and often require significant data and/or computational time. Here, we developed a supervised machine learning approach to classify paleologs from a target WGD in diploidized genomes across a broad range of different duplication histories. We collected empirical data on syntenic block sizes and other genomic features from 27 plant species each with a different history of paleopolyploidy. Features from these genomes were used to develop simulations of syntenic blocks and paleologs to train a gradient boosted decision tree. Using this approach, Frackify (Fractionation Classify), we were able to accurately identify and classify paleologs across a broad range of parameter space, including cases with multiple overlapping WGDs. We then compared Frackify with other paleolog inference approaches in six species with paleotetraploid and paleohexaploid ancestries. Frackify provides a way to combine multiple genomic features to quickly classify paleologs while providing a high degree of consistency with existing approaches.

摘要

几乎所有的陆生植物谱系在其历史上都经历过至少一次全基因组加倍(WGD)。这些古老的 WGD 的遗产仍然可以在现存植物的二倍体化基因组中观察到。源自 WGD-古基因座的基因可以在二倍体化基因组中保留数百万年。这些古基因座有可能通过亚功能化和新功能化、增加遗传多样性以及谱系间的相互基因丢失来塑造植物进化。目前用于分类古基因座的方法通常仅依赖于潜在基因组特征的子集,具有不同的准确性,并且通常需要大量数据和/或计算时间。在这里,我们开发了一种监督机器学习方法,用于对广泛不同的重复历史中二倍体化基因组中的目标 WGD 中的古基因座进行分类。我们从 27 个每个都具有不同古多倍体化历史的植物物种中收集了关于同线性块大小和其他基因组特征的经验数据。这些基因组的特征用于开发同线性块和古基因座的模拟,以训练梯度提升决策树。使用这种方法,Frackify(分裂分类),我们能够在广泛的参数空间中准确识别和分类古基因座,包括具有多个重叠 WGD 的情况。然后,我们在具有古四倍体和古六倍体祖先的六个物种中,将 Frackify 与其他古基因座推断方法进行了比较。Frackify 提供了一种结合多种基因组特征快速分类古基因座的方法,同时与现有方法具有高度一致性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验