SISSA, 34136, Trieste, Italy.
Centre for Evolution and Cancer, The Institute of Cancer Research, London, SM2 5NG, UK.
BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x.
The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence.
We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results.
The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.
鉴定蛋白质家族对于计算机蛋白质注释具有突出的实际重要性,并且是几个生物信息资源的基础。Pfam 可能是最著名的蛋白质家族数据库,由领域专家多年的工作构建而成,广泛使用手动注释。这种方法通常非常准确,但非常耗时,并且可能会受到手动注释本身产生的偏差的影响,这种偏差通常是由可用的实验证据指导的。
我们引入了一种旨在自动识别假定蛋白质家族的程序。该程序基于密度峰聚类,仅使用蛋白质序列之间的局部两两比对作为输入。在我们这里呈现的实验中,我们在大约 4000 个全长蛋白质上运行了该算法,这些蛋白质至少有一个被 Pfam 归类为属于假尿嘧啶合酶和考古核苷转移酶(PUA)族的结构域。我们得到了 71 个自动生成的序列簇,每个簇至少有 100 个成员。虽然我们的簇与 Pfam 分类基本一致,与单域或多域 Pfam 家族结构具有良好的重叠,但我们也观察到一些不一致。后者使用结构和序列证据进行了检查,这些证据表明自动分类捕获了反映蛋白质家族结构非平凡特征的进化信号。基于此分析,我们鉴定了一个假定的新的预 PUA 结构域以及几个 PUA 或 PUA 相关家族的替代边界。作为我们的方法不太可能是特定于家族的第一个迹象,我们在 P53 家族上执行了相同的分析,得到了可比的结果。
本文描述的聚类程序利用了大量两两比对中包含的信息,以无监督的方式成功地识别了一组假定的家族和家族结构。与 Pfam 分类的比较突出了显著的重叠,并指出了有趣的差异,表明我们的新算法在与自动蛋白质分类相关的应用中可能具有潜力。然而,要验证这一假设,需要在大型和多样化的序列数据集上进行进一步的实验。