Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands.
Institute of Chemistry, University of Tartu, Ravila 14a, 50411 Tartu, Estonia.
Anal Chem. 2022 Nov 8;94(44):15464-15471. doi: 10.1021/acs.analchem.2c03565. Epub 2022 Oct 25.
A major obstacle for reusing and integrating existing data is finding the data that is most relevant in a given context. The primary metadata resource is the scientific literature describing the experiments that produced the data. To stimulate the development of natural language processing methods for extracting this information from articles, we have manually annotated 100 recent open access publications in Analytical Chemistry as semantic graphs. We focused on articles mentioning mass spectrometry in their experimental sections, as we are particularly interested in the topic, which is also within the domain of several ontologies and controlled vocabularies. The resulting gold standard dataset is publicly available and directly applicable to validating automated methods for retrieving this metadata from the literature. In the process, we also made a number of observations on the structure and description of experiments and open access publication in this journal.
在给定的上下文中找到最相关的数据是重用和集成现有数据的主要障碍。主要的元数据资源是描述产生数据的实验的科学文献。为了激发从文章中提取这些信息的自然语言处理方法的发展,我们已经手动注释了 100 篇最近的分析化学领域的开放获取出版物作为语义图。我们专注于在实验部分提到质谱的文章,因为我们对这个主题特别感兴趣,这也在几个本体和受控词汇表的范围内。由此产生的黄金标准数据集是公开的,可直接应用于验证从文献中自动检索此元数据的方法。在这个过程中,我们还对该期刊中的实验和开放获取出版物的结构和描述进行了一些观察。