Hoffmann Martin A, Kretschmer Fleming, Ludwig Marcus, Böcker Sebastian
Chair for Bioinformatics, Institute for Computer Science, Friedrich-Schiller-University Jena, 07743 Jena, Germany.
Metabolites. 2023 Feb 21;13(3):314. doi: 10.3390/metabo13030314.
Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics usually relies on mass spectrometry, a technology capable of detecting thousands of compounds in a biological sample. Metabolite annotation is executed using tandem mass spectrometry. Spectral library search is far from comprehensive, and numerous compounds remain unannotated. So-called in silico methods allow us to overcome the restrictions of spectral libraries, by searching in much larger molecular structure databases. Yet, after more than a decade of method development, in silico methods still do not reach the correct annotation rates that users would wish for. Here, we present a novel computational method called Mad Hatter for this task. Mad Hatter combines CSI:FingerID results with information from the searched structure database via a metascore. Compound information includes the melting point, and the number of words in the compound description starting with the letter 'u'. We then show that Mad Hatter reaches a stunning 97.6% correct annotations when searching PubChem, one of the largest and most comprehensive molecular structure databases. Unfortunately, Mad Hatter is not a real method. Rather, we developed Mad Hatter solely for the purpose of demonstrating common issues in computational method development and evaluation. We explain what evaluation glitches were necessary for Mad Hatter to reach this annotation level, what is wrong with similar metascores in general, and why metascores may screw up not only method evaluations but also the analysis of biological experiments. This paper may serve as an example of problems in the development and evaluation of machine learning models for metabolite annotation.
代谢物提供了细胞状态的直接功能特征。非靶向代谢组学通常依赖于质谱技术,该技术能够检测生物样品中的数千种化合物。代谢物注释通过串联质谱进行。光谱库搜索远非全面,许多化合物仍未得到注释。所谓的计算机模拟方法使我们能够通过在更大的分子结构数据库中搜索来克服光谱库的限制。然而,经过十多年的方法开发,计算机模拟方法仍未达到用户期望的正确注释率。在此,我们提出了一种名为“疯帽匠”的新型计算方法来完成这项任务。“疯帽匠”通过一个元分数将CSI:FingerID结果与来自搜索到的结构数据库的信息相结合。化合物信息包括熔点以及化合物描述中以字母“u”开头的单词数量。然后我们表明,在搜索最大且最全面的分子结构数据库之一的PubChem时,“疯帽匠”达到了惊人的97.6%的正确注释率。不幸的是,“疯帽匠”并不是一种真正的方法。相反,我们开发“疯帽匠”仅仅是为了展示计算方法开发和评估中的常见问题。我们解释了“疯帽匠”要达到这个注释水平需要哪些评估漏洞,一般情况下类似的元分数存在哪些问题,以及为什么元分数不仅可能搞砸方法评估,还可能搞砸生物实验的分析。本文可作为代谢物注释机器学习模型开发和评估中问题的一个示例。