Suppr超能文献

基于寡聚体距离的远程同源性检测。

Remote homology detection based on oligomer distances.

作者信息

Lingner Thomas, Meinicke Peter

机构信息

Abteilung Bioinformatik, Institut für Mikrobiologie und Genetik, Georg-August-Universität Göttingen Goldschmidtstr. 1, 37077 Göttingen, Germany.

出版信息

Bioinformatics. 2006 Sep 15;22(18):2224-31. doi: 10.1093/bioinformatics/btl376. Epub 2006 Jul 12.

Abstract

MOTIVATION

Remote homology detection is among the most intensively researched problems in bioinformatics. Currently discriminative approaches, especially kernel-based methods, provide the most accurate results. However, kernel methods also show several drawbacks: in many cases prediction of new sequences is computationally expensive, often kernels lack an interpretable model for analysis of characteristic sequence features, and finally most approaches make use of so-called hyperparameters which complicate the application of methods across different datasets.

RESULTS

We introduce a feature vector representation for protein sequences based on distances between short oligomers. The corresponding feature space arises from distance histograms for any possible pair of K-mers. Our distance-based approach shows important advantages in terms of computational speed while on common test data the prediction performance is highly competitive with state-of-the-art methods for protein remote homology detection. Furthermore the learnt model can easily be analyzed in terms of discriminative features and in contrast to other methods our representation does not require any tuning of kernel hyperparameters.

AVAILABILITY

Normalized kernel matrices for the experimental setup can be downloaded at www.gobics.de/thomas. Matlab code for computing the kernel matrices is available upon request.

CONTACT

thomas@gobics.de, peter@gobics.de.

摘要

动机

远程同源性检测是生物信息学中研究最为深入的问题之一。目前,判别方法,尤其是基于核的方法,能提供最准确的结果。然而,核方法也存在一些缺点:在许多情况下,新序列的预测计算成本很高,核通常缺乏用于分析特征序列特征的可解释模型,最后,大多数方法使用所谓的超参数,这使得方法在不同数据集上的应用变得复杂。

结果

我们基于短寡聚体之间的距离引入了一种蛋白质序列的特征向量表示。相应的特征空间来自于任何可能的K-mer对的距离直方图。我们基于距离的方法在计算速度方面显示出重要优势,而在常见测试数据上,预测性能与蛋白质远程同源性检测的最先进方法相比具有很强的竞争力。此外,所学习的模型可以很容易地根据判别特征进行分析,并且与其他方法不同,我们的表示不需要对核超参数进行任何调整。

可用性

实验设置的归一化核矩阵可在www.gobics.de/thomas下载。计算核矩阵的Matlab代码可根据请求提供。

联系方式

thomas@gobics.depeter@gobics.de

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验