Suppr
超能文献

利用基因组冗余来提高同源蛋白的推断和比对。

Leveraging genomic redundancy to improve inference and alignment of orthologous proteins.

机构信息

Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA 94720, USA.

Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720, USA.

出版信息

G3 (Bethesda). 2023 Dec 6;13(12). doi: 10.1093/g3journal/jkad222.

DOI:10.1093/g3journal/jkad222

PMID:37770067

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10700111/

Abstract

Identifying protein sequences with common ancestry is a core task in bioinformatics and evolutionary biology. However, methods for inferring and aligning such sequences in annotated genomes have not kept pace with the increasing scale and complexity of the available data. Thus, in this work, we implemented several improvements to the traditional methodology that more fully leverage the redundancy of closely related genomes and the organization of their annotations. Two highlights include the application of the more flexible k-clique percolation algorithm for identifying clusters of orthologous proteins and the development of a novel technique for removing poorly supported regions of alignments with a phylogenetic hidden Markov model (phylo-HMM). In making the latter, we wrote a fully documented Python package Homomorph that implements standard HMM algorithms and created a set of tutorials to promote its use by a wide audience. We applied the resulting pipeline to a set of 33 annotated Drosophila genomes, generating 22,813 orthologous groups and 8,566 high-quality alignments.

摘要

鉴定具有共同祖先的蛋白质序列是生物信息学和进化生物学的核心任务。然而，在注释基因组中推断和对齐这些序列的方法并没有跟上可用数据规模和复杂性的不断增加。因此，在这项工作中，我们对传统方法进行了几项改进，以更充分地利用密切相关基因组的冗余性及其注释的组织方式。两个亮点包括应用更灵活的 k-团渗滤算法来识别同源蛋白簇，以及开发一种利用系统发育隐马尔可夫模型 (phylo-HMM) 去除对齐中支持不足区域的新技术。在构建后者时，我们编写了一个完全记录的 Python 包 Homomorph，实现了标准的 HMM 算法，并创建了一组教程，以促进广泛的受众使用它。我们将生成的管道应用于一组 33 个注释的果蝇基因组，生成了 22813 个直系同源群和 8566 个高质量的对齐。