蛋白质序列的密度峰值聚类与 Pfam 家族相关，与手动家族注释相比，揭示了明显的相似性和有趣的差异。

Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.

机构信息

SISSA, 34136, Trieste, Italy.

Centre for Evolution and Cancer, The Institute of Cancer Research, London, SM2 5NG, UK.

出版信息

BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x.

DOI:10.1186/s12859-021-04013-x

PMID:33711918

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7955657/

Abstract

BACKGROUND

The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence.

RESULTS

We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results.

CONCLUSIONS

The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

摘要

背景

鉴定蛋白质家族对于计算机蛋白质注释具有突出的实际重要性，并且是几个生物信息资源的基础。Pfam 可能是最著名的蛋白质家族数据库，由领域专家多年的工作构建而成，广泛使用手动注释。这种方法通常非常准确，但非常耗时，并且可能会受到手动注释本身产生的偏差的影响，这种偏差通常是由可用的实验证据指导的。

结果

我们引入了一种旨在自动识别假定蛋白质家族的程序。该程序基于密度峰聚类，仅使用蛋白质序列之间的局部两两比对作为输入。在我们这里呈现的实验中，我们在大约 4000 个全长蛋白质上运行了该算法，这些蛋白质至少有一个被 Pfam 归类为属于假尿嘧啶合酶和考古核苷转移酶（PUA）族的结构域。我们得到了 71 个自动生成的序列簇，每个簇至少有 100 个成员。虽然我们的簇与 Pfam 分类基本一致，与单域或多域 Pfam 家族结构具有良好的重叠，但我们也观察到一些不一致。后者使用结构和序列证据进行了检查，这些证据表明自动分类捕获了反映蛋白质家族结构非平凡特征的进化信号。基于此分析，我们鉴定了一个假定的新的预 PUA 结构域以及几个 PUA 或 PUA 相关家族的替代边界。作为我们的方法不太可能是特定于家族的第一个迹象，我们在 P53 家族上执行了相同的分析，得到了可比的结果。