Clustering Highly Divergent Homologous Proteins: An Alignment-Free Method.

Suppr

超能文献

作者信息

Muñoz-Baena Laura, Poon Art F Y

机构信息

Department of Microbiology and Immunology, Western University, London, Ontario, Canada.

Department of Pathology and Laboratory Medicine, Western University, London, Ontario, Canada.

出版信息

Curr Protoc. 2023 Feb;3(2):e666. doi: 10.1002/cpz1.666.

DOI:10.1002/cpz1.666

PMID:36809686

Abstract

The comparative analysis of amino acid sequences is an important tool in molecular biology that often requires multiple sequence alignments. In comparisons between less closely related genomes, however, it becomes more difficult to accurately align protein-coding sequences, or even to identify homologous regions in different genomes. In this article, we describe an alignment-free method for the classification of homologous protein-coding regions from different genomes. This methodology was originally developed for comparing genomes within virus families, but may be adapted for other organisms. We quantify sequence homology from the overlap (intersection distance) of the k-mer (word) frequency distributions for different protein sequences. Next, we extract groups of homologous sequences from the resulting distance matrix using a combination of dimensionality reduction and hierarchical clustering methods. Finally, we demonstrate how to generate visualizations of the composition of clusters with respect to protein annotations, and by coloring protein-coding regions of genomes by cluster assignments. These provide a useful means to quickly assess the reliability of the clustering results based on the distribution of homologous genes among genomes. © 2023 Wiley Periodicals LLC. Basic Protocol 1: Data collection and processing Basic Protocol 2: Calculating k-mer distances Basic Protocol 3: Extracting clusters of homology Support Protocol: Genome plot based on clustering results.

摘要