潜在语义分析在蛋白质远程同源性检测中的应用。

Application of latent semantic analysis to protein remote homology detection.

作者信息

Dong Qi-Wen, Wang Xiao-Long, Lin Lei

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

出版信息

Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.

DOI:10.1093/bioinformatics/bti801

PMID:16317074

Abstract

MOTIVATION

Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification.

RESULTS

In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the 'words' of 'protein sequence language', including N-grams, patterns and motifs. Each protein sequence is taken as a 'document' that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.

摘要

动机

蛋白质序列之间的远程同源性检测是计算生物学中的一个核心问题。支持向量机（SVM）等判别方法是最有效的方法之一。许多基于SVM的方法专注于寻找蛋白质序列的有用表示，使用显式特征向量表示或核函数。由于特征通常非常大且可能引入噪声数据，这些表示在许多机器学习方法中可能会出现峰值现象。基于这些观察结果，本研究专注于支持向量机蛋白质分类中蛋白质向量的特征提取和有效表示。

结果

在本研究中，一种潜在语义分析（LSA）模型被引入到蛋白质远程同源性检测中，该模型是一种来自自然语言处理的有效特征提取技术。已经研究了蛋白质序列的几个基本组成部分作为“蛋白质序列语言”的“单词”，包括N元语法、模式和基序。每个蛋白质序列都被视为一个由词袋组成的“文档”。首先构建词-文档矩阵。对该矩阵执行潜在语义分析以生成蛋白质序列的潜在语义表示向量，从而实现蛋白质序列的去噪和智能描述。然后通过支持向量机对潜在语义表示向量进行评估。该方法在SCOP 1.53数据库上进行了测试。结果表明，与基本形式主义相比，潜在语义分析模型显著提高了远程同源性检测的性能。此外，该方法的性能与SVM-LA等复杂核方法相当，且优于PSI-BLAST和SVM成对比较等其他基于序列的方法。

相似文献

Application of latent semantic analysis to protein remote homology detection.

Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Support vector machines with profile-based kernels for remote protein homology detection.

Genome Inform. 2004;15(2):191-200.

Remote protein homology detection and fold recognition using two-layer support vector machine classifiers.

Comput Biol Med. 2011 Aug;41(8):687-99. doi: 10.1016/j.compbiomed.2011.06.004. Epub 2011 Jun 25.

Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection.

Bioinformatics. 2008 May 15;24(10):1264-70. doi: 10.1093/bioinformatics/btn112. Epub 2008 Mar 31.

Mismatch string kernels for discriminative protein classification.

Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22.

Profile-based string kernels for remote homology detection and motif extraction.

J Bioinform Comput Biol. 2005 Jun;3(3):527-50. doi: 10.1142/s021972000500120x.

Remote homology detection based on oligomer distances.

Bioinformatics. 2006 Sep 15;22(18):2224-31. doi: 10.1093/bioinformatics/btl376. Epub 2006 Jul 12.

Profile-based direct kernels for remote homology detection and fold recognition.

Bioinformatics. 2005 Dec 1;21(23):4239-47. doi: 10.1093/bioinformatics/bti687. Epub 2005 Sep 27.

引用本文的文献

Major advances in protein function assignment by remote homolog detection with protein language models - A review.

Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.

SubMDTA: drug target affinity prediction based on substructure extraction and multi-scale features.

BMC Bioinformatics. 2023 Sep 7;24(1):334. doi: 10.1186/s12859-023-05460-4.

Prediction of small molecule drug-miRNA associations based on GNNs and CNNs.

Front Genet. 2023 May 30;14:1201934. doi: 10.3389/fgene.2023.1201934. eCollection 2023.

Prediction of Potential Commercially Available Inhibitors against SARS-CoV-2 by Multi-Task Deep Learning Model.

Biomolecules. 2022 Aug 21;12(8):1156. doi: 10.3390/biom12081156.

PRIP: A Protein-RNA Interface Predictor Based on Semantics of Sequences.

Life (Basel). 2022 Feb 18;12(2):307. doi: 10.3390/life12020307.

4mCPred-MTL: Accurate Identification of DNA 4mC Sites in Multiple Species Using Multi-Task Deep Learning Based on Multi-Head Attention Mechanism.

Front Cell Dev Biol. 2021 May 10;9:664669. doi: 10.3389/fcell.2021.664669. eCollection 2021.

iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool.

Front Genet. 2021 Mar 31;12:663572. doi: 10.3389/fgene.2021.663572. eCollection 2021.

An integration of deep learning with feature embedding for protein-protein interaction prediction.

PeerJ. 2019 Jun 17;7:e7126. doi: 10.7717/peerj.7126. eCollection 2019.

Identifying the missing proteins in human proteome by biological language model.

BMC Syst Biol. 2016 Dec 23;10(Suppl 4):113. doi: 10.1186/s12918-016-0352-6.

dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation.

Sci Rep. 2016 Sep 1;6:32333. doi: 10.1038/srep32333.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

潜在语义分析在蛋白质远程同源性检测中的应用。

Application of latent semantic analysis to protein remote homology detection.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献