Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong, P. R. China.
National Institute of Health Data Science of China, Shandong University, Shandong, P. R. China.
PLoS Comput Biol. 2022 Dec 1;18(12):e1010669. doi: 10.1371/journal.pcbi.1010669. eCollection 2022 Dec.
The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to "state-of-the-art," take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
基因组测序数据的广泛可及性解释了基于机器学习的方法在预测蛋白质特性方面的流行,这些方法基于其氨基酸序列。多年来,在修订我们自己的工作、阅读提交的手稿以及已发表的论文时,我们注意到了几个反复出现的问题,这些问题使得一些报告的发现难以理解和复制。我们怀疑这可能是由于生物学家不熟悉机器学习方法,或者相反,机器学习专家可能会错过一些将其方法正确应用于蛋白质所需的知识。在这里,我们旨在为这些方法的开发者弥合这一差距。最引人注目的问题与缺乏清晰度有关:如何获得感兴趣的注释;使用了哪些基准指标;如何定义阳性和阴性。其他问题则与缺乏严谨性有关:如果您偷偷加入了结构信息,那么您的方法就不是基于序列的;如果您将自己的模型与“最先进的”方法进行比较,请采用最佳方法;如果您想得出某个方法比另一个方法更好的结论,请获得一个支持该结论的显著性估计值。我们将详细讨论这些问题以及其他问题。在写作过程中,这些问题对于作者来说可能是显而易见的;然而,对于读者来说,它们并不总是那么清晰。我们还预计,这些提示中的许多将适用于生物学中的其他基于机器学习的应用。因此,许多在这一特定领域开发方法的计算生物学家将从避免什么和做什么而不是避免什么中受益。