Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College, 131 Sir Alexander Fleming Building, South Kensington Campus, London, UK.
Metabolomics. 2022 Dec 5;18(12):102. doi: 10.1007/s11306-022-01962-z.
Compound identification remains a critical bottleneck in the process of exploiting Nuclear Magnetic Resonance (NMR) metabolomics data, especially for H 1-dimensional (H 1D) data. As databases of reference compound spectra have grown, workflows have evolved to rely heavily on their search functions to facilitate this process by generating lists of potential metabolites found in complex mixture data, facilitating annotation and identification. However, approaches for validating and communicating annotations are most often guided by expert knowledge, and therefore are highly variable despite repeated efforts to align practices and define community standards.
This review is aimed at broadening the application of automated annotation tools by discussing the key ideas of spectral matching and beginning to describe a set of terms to classify this information, thus advancing standards for communicating annotation confidence. Additionally, we hope that this review will facilitate the growing collaboration between chemical data scientists, software developers and the NMR metabolomics community aiding development of long-term software solutions.
We begin with a brief discussion of the typical untargeted NMR identification workflow. We differentiate between annotation (hypothesis generation, filtering), and identification (hypothesis testing, verification), and note the utility of different NMR data features for annotation. We then touch on three parts of annotation: (1) generation of queries, (2) matching queries to reference data, and (3) scoring and confidence estimation of potential matches for verification. In doing so, we highlight existing approaches to automated and semi-automated annotation from the perspective of the structural information they utilize, as well as how this information can be represented computationally.
化合物鉴定仍然是利用核磁共振(NMR)代谢组学数据的关键瓶颈,特别是对于 H1 维(H1D)数据。随着参考化合物光谱数据库的增长,工作流程已经发展到严重依赖它们的搜索功能,通过生成在复杂混合物数据中发现的潜在代谢物列表来促进这一过程,从而促进注释和鉴定。然而,注释的验证和交流方法通常是由专家知识指导的,因此尽管反复努力使实践保持一致并定义社区标准,但仍然存在很大的差异。
通过讨论光谱匹配的关键思想,并开始描述一组术语来对该信息进行分类,从而推进注释置信度交流的标准,从而拓宽自动化注释工具的应用。此外,我们希望本综述将促进化学数据科学家、软件开发人员和 NMR 代谢组学社区之间的日益合作,有助于开发长期的软件解决方案。
我们首先简要讨论了典型的非靶向 NMR 鉴定工作流程。我们区分了注释(假设生成、过滤)和鉴定(假设测试、验证),并指出了不同 NMR 数据特征在注释中的用途。然后,我们介绍了注释的三个部分:(1)查询的生成,(2)将查询与参考数据匹配,以及(3)对潜在匹配的评分和置信度估计进行验证。在这样做的过程中,我们从它们利用的结构信息的角度,以及如何以计算的方式表示这些信息,强调了现有的自动化和半自动化注释方法。