使用机器学习进行基于序列的蛋白质性质预测的十个快速技巧。

Ten quick tips for sequence-based prediction of protein properties using machine learning.

机构信息

Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong, P. R. China.

National Institute of Health Data Science of China, Shandong University, Shandong, P. R. China.

出版信息

PLoS Comput Biol. 2022 Dec 1;18(12):e1010669. doi: 10.1371/journal.pcbi.1010669. eCollection 2022 Dec.

DOI:10.1371/journal.pcbi.1010669

PMID:36454728

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9714715/

Abstract

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

摘要

基因组测序数据的广泛可及性解释了基于机器学习的方法在预测蛋白质特性方面的流行，这些方法基于其氨基酸序列。多年来，在修订我们自己的工作、阅读提交的手稿以及已发表的论文时，我们注意到了几个反复出现的问题，这些问题使得一些报告的发现难以理解和复制。我们怀疑这可能是由于生物学家不熟悉机器学习方法，或者相反，机器学习专家可能会错过一些将其方法正确应用于蛋白质所需的知识。在这里，我们旨在为这些方法的开发者弥合这一差距。最引人注目的问题与缺乏清晰度有关：如何获得感兴趣的注释；使用了哪些基准指标；如何定义阳性和阴性。其他问题则与缺乏严谨性有关：如果您偷偷加入了结构信息，那么您的方法就不是基于序列的；如果您将自己的模型与“最先进的”方法进行比较，请采用最佳方法；如果您想得出某个方法比另一个方法更好的结论，请获得一个支持该结论的显著性估计值。我们将详细讨论这些问题以及其他问题。在写作过程中，这些问题对于作者来说可能是显而易见的；然而，对于读者来说，它们并不总是那么清晰。我们还预计，这些提示中的许多将适用于生物学中的其他基于机器学习的应用。因此，许多在这一特定领域开发方法的计算生物学家将从避免什么和做什么而不是避免什么中受益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96dd/9714715/f4f8a4965493/pcbi.1010669.g001.jpg

相似文献

Ten quick tips for sequence-based prediction of protein properties using machine learning.使用机器学习进行基于序列的蛋白质性质预测的十个快速技巧。

PLoS Comput Biol. 2022 Dec 1;18(12):e1010669. doi: 10.1371/journal.pcbi.1010669. eCollection 2022 Dec.

Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes.基于数据驱动的血糖动力学建模与预测：机器学习在 1 型糖尿病中的应用。

Artif Intell Med. 2019 Jul;98:109-134. doi: 10.1016/j.artmed.2019.07.007. Epub 2019 Jul 26.

Rules to be adopted for publishing a scientific paper.发表科学论文应采用的规则。

Ann Ital Chir. 2016;87:1-3.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

DDBJ Data Analysis Challenge: a machine learning competition to predict Arabidopsis chromatin feature annotations from DNA sequences.DDBJ数据分析挑战赛：一项从DNA序列预测拟南芥染色质特征注释的机器学习竞赛。

Genes Genet Syst. 2020 Apr 22;95(1):43-50. doi: 10.1266/ggs.19-00034. Epub 2020 Mar 26.

The invited review ? or, my field, from my standpoint, written by me using only my data and my ideas, and citing only my publications.受邀综述——或者，就我的领域而言，从我的立场出发，仅使用我的数据和观点撰写，并仅引用我的出版物。

J Cell Sci. 2000;113(Pt 18):3125-3126. doi: 10.1242/jcs.113.18.3125.

Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences.利用蛋白质序列的物理化学性质进行泛素化位点预测的计算方法。

BMC Bioinformatics. 2016 Mar 3;17:116. doi: 10.1186/s12859-016-0959-z.

The future of Cochrane Neonatal.考克兰新生儿协作网的未来。

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques.一种使用机器学习技术的新型糖尿病医疗保健疾病预测框架。

J Healthc Eng. 2022 Jan 11;2022:1684017. doi: 10.1155/2022/1684017. eCollection 2022.

引用本文的文献

PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology.PIPENN-EMB集成网络和蛋白质嵌入技术将蛋白质界面预测推广到同源性之外。

Sci Rep. 2025 Feb 5;15(1):4391. doi: 10.1038/s41598-025-88445-y.

Transcription factor prediction using protein 3D secondary structures.利用蛋白质三维二级结构进行转录因子预测。

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae762.

Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties.多肽：利用多模态学习肽特性的语言图模型

J Chem Inf Model. 2025 Jan 13;65(1):83-91. doi: 10.1021/acs.jcim.4c01443. Epub 2024 Dec 19.

PatchProt: hydrophobic patch prediction using protein foundation models.PatchProt：使用蛋白质基础模型进行疏水补丁预测。

Bioinform Adv. 2024 Oct 14;4(1):vbae154. doi: 10.1093/bioadv/vbae154. eCollection 2024.

Seven quick tips for gene-focused computational pangenomic analysis.基因聚焦计算泛基因组分析的七个快速提示。

BioData Min. 2024 Sep 3;17(1):28. doi: 10.1186/s13040-024-00380-2.

Seq2Phase: language model-based accurate prediction of client proteins in liquid-liquid phase separation.Seq2Phase：基于语言模型的液-液相分离中客户蛋白的准确预测

Bioinform Adv. 2023 Dec 22;4(1):vbad189. doi: 10.1093/bioadv/vbad189. eCollection 2024.

Pitfalls of machine learning models for protein-protein interaction networks.机器学习模型在蛋白质-蛋白质相互作用网络中的陷阱。

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae012.

Significance of Sequence Features in Classification of Protein-Protein Interactions Using Machine Learning.基于机器学习的蛋白质-蛋白质相互作用分类中序列特征的意义。

Protein J. 2024 Feb;43(1):72-83. doi: 10.1007/s10930-023-10168-8. Epub 2023 Dec 19.

本文引用的文献

Editorial: the 20th annual Nucleic Acids Research Web Server Issue 2022.社论：《核酸研究》2022年第20届年度网络服务器专刊

Nucleic Acids Res. 2022 Jul 5;50(W1):W1-W3. doi: 10.1093/nar/gkac525.

How sticky are our proteins? Quantifying hydrophobicity of the human proteome.我们的蛋白质有多黏？量化人类蛋白质组的疏水性。

Bioinform Adv. 2022 Jan 25;2(1):vbac002. doi: 10.1093/bioadv/vbac002. eCollection 2022.

ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.蛋白质 GLUE 多任务基准套件，用于自监督蛋白质建模。

Sci Rep. 2022 Sep 26;12(1):16047. doi: 10.1038/s41598-022-19608-4.

Multi-task learning to leverage partially annotated data for PPI interface prediction.多任务学习利用部分注释数据进行 PPI 界面预测。

Sci Rep. 2022 Jun 21;12(1):10487. doi: 10.1038/s41598-022-13951-2.

NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning.NetSurfP-3.0：通过蛋白质语言模型和深度学习实现蛋白质结构特征的准确快速预测。

Nucleic Acids Res. 2022 Jul 5;50(W1):W510-W515. doi: 10.1093/nar/gkac439.

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update.Galaxy 平台：用于可访问、可重复和协作的生物医学分析：2022 更新。

Nucleic Acids Res. 2022 Jul 5;50(W1):W345-W351. doi: 10.1093/nar/gkac247.

Ten quick tips for deep learning in biology.生物学深度学习的十条快速提示。

PLoS Comput Biol. 2022 Mar 24;18(3):e1009803. doi: 10.1371/journal.pcbi.1009803. eCollection 2022 Mar.

PIPENN: protein interface prediction from sequence with an ensemble of neural nets.PIPENN：利用神经网络集成从序列预测蛋白质界面

Bioinformatics. 2022 Apr 12;38(8):2111-2118. doi: 10.1093/bioinformatics/btac071.

Assigning protein function from domain-function associations using DomFun.基于域-功能关联来分配蛋白质功能，使用 DomFun。

BMC Bioinformatics. 2022 Jan 15;23(1):43. doi: 10.1186/s12859-022-04565-6.

The impact of AlphaFold2 one year on.AlphaFold2发布一年后的影响。（原英文表述不太准确，推测完整意思可能是这样，根据准确英文原文调整翻译会更准确）

Nat Methods. 2022 Jan;19(1):15-20. doi: 10.1038/s41592-021-01365-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用机器学习进行基于序列的蛋白质性质预测的十个快速技巧。

Ten quick tips for sequence-based prediction of protein properties using machine learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献