Spasić Irena, Schober Daniel, Sansone Susanna-Assunta, Rebholz-Schuhmann Dietrich, Kell Douglas B, Paton Norman W
Manchester Centre for Integrative Systems Biology, The University of Manchester, 131 Princess Street, Manchester, M1 7ND, UK.
BMC Bioinformatics. 2008 Apr 29;9 Suppl 5(Suppl 5):S5. doi: 10.1186/1471-2105-9-S5-S5.
Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually.
We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts.
We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.
许多生物信息学应用依赖于受控词汇表或本体,以便一致地解释和无缝整合分散在公共资源中的信息。代谢组学研究的实验数据集不仅需要相互整合,还需要按照系统生物学的理念与其他类型的组学研究产生的数据进行整合,因此代谢组学迫切需要词汇表和本体。然而,手动构建这些资源既耗时又不容易。
我们描述了一种快速开发受控词汇表的方法,该研究最初是由描述代谢组学技术的词汇表需求所推动的。我们展示了两个案例研究,涉及两个受控词汇表(用于核磁共振光谱和气相色谱),它们的开发目前正在作为代谢组学标准倡议的一部分进行。初始词汇表是手动编制的,分别提供了243个和152个术语。从文献中自动获取了总共5699个和2612个新术语。结果分析表明,全文文章(尤其是材料与方法部分)是特定技术术语的主要来源,而不是论文摘要。
我们建议采用一种基于语料库的高效术语获取文本挖掘方法,作为一种用科学文献中使用的术语快速扩展受控词汇表集的方法。我们采用了一种综合方法,结合相对通用的软件和数据资源,以经济高效的方式开发一个文本挖掘工具,用于跨领域扩展受控词汇表,作为手动术语收集和定制命名实体识别方法的实用替代方案。