基于标准化蛋白质序列的机器学习。

Machine learning on normalized protein sequences.

作者信息

Heider Dominik, Verheyen Jens, Hoffmann Daniel

机构信息

Department of Bioinformatics, Center of Medical Biotechnology, University of Duisburg-Essen, Universitaetsstr, 2, 45117 Essen, Germany.

出版信息

BMC Res Notes. 2011 Mar 31;4:94. doi: 10.1186/1756-0500-4-94.

DOI:10.1186/1756-0500-4-94

PMID:21453485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3079662/

Abstract

BACKGROUND

Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.

FINDINGS

We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.

CONCLUSIONS

We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.

摘要

背景

机器学习技术已广泛应用于生物序列，例如从药物靶蛋白序列和蛋白质功能类别预测HIV-1中的耐药性。由于生物序列中缺失和插入频繁出现，当前方法的一个主要限制是无法处理不同的序列长度。

研究结果

我们建议将序列归一化为统一长度。为此，我们测试了一种线性和四种不同的非线性插值方法，用于对19个分类数据集的序列长度进行归一化。分类任务包括从药物靶序列预测HIV-1耐药性以及基于序列的蛋白质功能预测。我们将随机森林应用于将序列分类为“阳性”和“阴性”样本。统计测试表明，在大多数分析数据集中，线性插值优于非线性插值方法，而在少数情况下，非线性方法具有小但显著的优势。与其他已发表的方法相比，我们的预测方案可将预测准确率提高多达14%。

结论

我们发现，与现有技术相比，对通过简单线性插值归一化的序列进行机器学习可产生更好或至少具有竞争力的结果，因此，是现有方法的一种有前途的替代方法，特别是对于可变长度的蛋白质序列。

相似文献

Machine learning on normalized protein sequences.基于标准化蛋白质序列的机器学习。

BMC Res Notes. 2011 Mar 31;4:94. doi: 10.1186/1756-0500-4-94.

Interpol: An R package for preprocessing of protein sequences.Interpro: 一个用于蛋白质序列预处理的 R 包。

BioData Min. 2011 Jun 17;4:16. doi: 10.1186/1756-0381-4-16.

PseAAC2Vec protein encoding for TCR protein sequence classification.用于 TCR 蛋白序列分类的 PseAAC2Vec 蛋白编码。

Comput Biol Med. 2024 Mar;170:107956. doi: 10.1016/j.compbiomed.2024.107956. Epub 2024 Jan 4.

A comparative analysis of amino acid encoding schemes for the prediction of flexible length linear B-cell epitopes.氨基酸编码方案在预测柔性长度线性 B 细胞表位中的比较分析。

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac356.

Prediction of Protein-Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures.基于机器学习的数据清理和后过滤程序预测蛋白质-蛋白质相互作用位点

J Membr Biol. 2016 Apr;249(1-2):141-53. doi: 10.1007/s00232-015-9856-z. Epub 2015 Nov 12.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

deepNEC: a novel alignment-free tool for the identification and classification of nitrogen biochemical network-related enzymes using deep learning.深度 NEC：一种新颖的无对齐工具，用于使用深度学习识别和分类与氮生化网络相关的酶。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac071.

How to approach machine learning-based prediction of drug/compound-target interactions.如何进行基于机器学习的药物/化合物-靶点相互作用预测。

J Cheminform. 2023 Feb 6;15(1):16. doi: 10.1186/s13321-023-00689-w.

Mining for class-specific motifs in protein sequence classification.蛋白质序列分类中的类特异性基序挖掘。

BMC Bioinformatics. 2013 Mar 15;14:96. doi: 10.1186/1471-2105-14-96.

Predicting HIV drug resistance using weighted machine learning method at target protein sequence-level.在目标蛋白质序列水平上使用加权机器学习方法预测HIV耐药性。

Mol Divers. 2021 Aug;25(3):1541-1551. doi: 10.1007/s11030-021-10262-y. Epub 2021 Jul 9.

引用本文的文献

Web Service for HIV Drug Resistance Prediction Based on Analysis of Amino Acid Substitutions in Main Drug Targets.基于主要药物靶点氨基酸替换分析的 HIV 耐药性预测的 Web 服务。

Viruses. 2023 Nov 11;15(11):2245. doi: 10.3390/v15112245.

A Computational Approach for the Prediction of Treatment History and the Effectiveness or Failure of Antiretroviral Therapy.一种预测抗逆转录病毒治疗的治疗史和有效性或失败的计算方法。

Int J Mol Sci. 2020 Jan 23;21(3):748. doi: 10.3390/ijms21030748.

Encodings and models for antimicrobial peptide classification for multi-resistant pathogens.用于多重耐药病原体抗菌肽分类的编码与模型

BioData Min. 2019 Mar 4;12:7. doi: 10.1186/s13040-019-0196-x. eCollection 2019.

A Computational Approach for the Prediction of HIV Resistance Based on Amino Acid and Nucleotide Descriptors.基于氨基酸和核苷酸描述符的 HIV 耐药性预测的计算方法。

Molecules. 2018 Oct 24;23(11):2751. doi: 10.3390/molecules23112751.

Automated prediction of HIV drug resistance from genotype data.基于基因型数据的HIV耐药性自动预测

BMC Bioinformatics. 2016 Aug 31;17 Suppl 8(Suppl 8):278. doi: 10.1186/s12859-016-1114-6.

SHIVA - a web application for drug resistance and tropism testing in HIV.SHIVA——一款用于HIV耐药性和嗜性检测的网络应用程序。

BMC Bioinformatics. 2016 Aug 22;17(1):314. doi: 10.1186/s12859-016-1179-2.

A simple structure-based model for the prediction of HIV-1 co-receptor tropism.一种基于结构的简单模型，用于预测 HIV-1 共受体嗜性。

BioData Min. 2014 Aug 1;7:14. doi: 10.1186/1756-0381-7-14. eCollection 2014.

A multifaceted analysis of HIV-1 protease multidrug resistance phenotypes.HIV-1 蛋白酶多药耐药表型的多方面分析。

BMC Bioinformatics. 2011 Dec 15;12:477. doi: 10.1186/1471-2105-12-477.

Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers.结构和基于序列的分类器组合提高了贝伐单抗耐药性预测。

BioData Min. 2011 Nov 14;4:26. doi: 10.1186/1756-0381-4-26.

Interpol: An R package for preprocessing of protein sequences.Interpro: 一个用于蛋白质序列预处理的 R 包。

BioData Min. 2011 Jun 17;4:16. doi: 10.1186/1756-0381-4-16.

本文引用的文献

Insights into the classification of small GTPases.对小GTP酶分类的见解。

Adv Appl Bioinform Chem. 2010;3:15-24. doi: 10.2147/aabc.s8891. Epub 2010 May 21.

Prediction of co-receptor usage of HIV-1 from genotype.从基因型预测 HIV-1 的辅助受体使用情况。

PLoS Comput Biol. 2010 Apr 15;6(4):e1000743. doi: 10.1371/journal.pcbi.1000743.

A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome.一种基于粗糙集的HIV-1逆转录酶耐药基因组模型。

Bioinform Biol Insights. 2009 Oct 5;3:109-27. doi: 10.4137/bbi.s3382.

Predicting Bevirimat resistance of HIV-1 from genotype.从基因型预测 HIV-1 对贝维立姆的耐药性。

BMC Bioinformatics. 2010 Jan 20;11:37. doi: 10.1186/1471-2105-11-37.

Prediction of protein binding sites in protein structures using hidden Markov support vector machine.利用隐马尔可夫支持向量机预测蛋白质结构中的蛋白质结合位点。

BMC Bioinformatics. 2009 Nov 20;10:381. doi: 10.1186/1471-2105-10-381.

Protein structure classification based on conserved hydrophobic residues.基于保守疏水残基的蛋白质结构分类。

IEEE/ACM Trans Comput Biol Bioinform. 2009 Oct-Dec;6(4):639-51. doi: 10.1109/TCBB.2008.77.

Pairwise and higher-order correlations among drug-resistance mutations in HIV-1 subtype B protease.HIV-1 B亚型蛋白酶耐药性突变之间的成对及高阶相关性。

BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S10. doi: 10.1186/1471-2105-10-S8-S10.

A computational approach for the identification of small GTPases based on preprocessed amino acid sequences.一种基于预处理氨基酸序列鉴定小GTP酶的计算方法。

Technol Cancer Res Treat. 2009 Oct;8(5):333-41. doi: 10.1177/153303460900800503.

Neural networks predict protein structure and function.

Methods Mol Biol. 2008;458:203-30. doi: 10.1007/978-1-60327-101-1_11.

HIV-1 coreceptor usage prediction without multiple alignments: an application of string kernels.无需多重比对的HIV-1共受体使用预测：字符串核的应用

Retrovirology. 2008 Dec 4;5:110. doi: 10.1186/1742-4690-5-110.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于标准化蛋白质序列的机器学习。

Machine learning on normalized protein sequences.

作者信息

机构信息

出版信息

BACKGROUND

FINDINGS

CONCLUSIONS

背景

研究结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献