Suppr超能文献

使用从序列相似性得分转换而来的新度量以及神经网络进行的序列比对来对蛋白质序列进行聚类。

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks.

作者信息

Ma Qicheng, Chirn Gung-Wei, Cai Richard, Szustakowski Joseph D, Nirmala N R

机构信息

Biomedical Computing, Genome and Proteome Sciences, Novartis Institutes for BioMedical Research, Inc., Cambridge, MA 02139, USA.

出版信息

BMC Bioinformatics. 2005 Oct 3;6:242. doi: 10.1186/1471-2105-6-242.

Abstract

BACKGROUND

The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30,000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12,000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes.

RESULTS

Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric.

CONCLUSION

Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.

摘要

背景

人类基因组测序使我们能够获取一份完整的基因列表(包括实验验证的和预测的)以进行进一步分析。虽然大约30000个已知和预测的人类编码基因中的大多数已被表征并至少被赋予了一种功能,但仍有相当数量的基因(约12000个)尚未得到注释。最近对其他基因组的测序为我们提供了大量辅助序列数据,这有助于对人类基因进行表征。将这些序列聚类成家族是对多个基因组进行比较研究的首要步骤之一。

结果

在此我们报告一种新型聚类算法(CLUGEN),该算法已被用于使用一种新型距离度量对来自所有已测序基因组的经实验验证的和预测的蛋白质序列进行聚类,这种距离度量是一对蛋白质序列之间的神经网络得分。这种距离度量基于成对序列相似性得分及其结构域结构之间的相似性。该距离度量是一对蛋白质序列属于同一Interpro家族/结构域的概率,这有助于对传递同源性封闭进行建模以检测远缘同源物。使用新的距离度量应用层次平均聚类方法。

结论

将我们的算法与文献中报道的算法进行基准测试研究表明,我们的算法提供的聚类结果具有更低的假阳性和假阴性率。该聚类算法被应用于对多个真核基因组和几十个原核基因组进行聚类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1c4b/1261163/61c58a77278d/1471-2105-6-242-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验