Suppr超能文献

基因组学问答中领域内词汇表的价值。

The value of an in-domain lexicon in genomics QA.

作者信息

Sasaki Yutaka, McNaught John, Ananiadou Sophia

机构信息

National Centre for Text Mining, School of Computer Science, University of Manchester, MIB, 131 Princess Street, Manchester M17DN, United Kingdom.

出版信息

J Bioinform Comput Biol. 2010 Feb;8(1):147-61. doi: 10.1142/s0219720010004513.

Abstract

This paper demonstrates that a large-scale lexicon tailored for the biology domain is effective in improving question analysis for genomics Question Answering (QA). We use the TREC Genomics Track data to evaluate the performance of different question analysis methods. It is hard to process textual information in biology, especially in molecular biology, due to a huge number of technical terms which rarely appear in general English documents and dictionaries. To support biological Text Mining, we have developed a domain-specific resource, the BioLexicon. Started in 2006 from scratch, this lexicon currently includes more than four million biomedical terms consisting of newly curated terms and terms collected from existing biomedical databases. While conventional genomics QA systems provide query expansion based on thesauri and dictionaries, it is not clear to what extent a biology-oriented lexical resource is effective for question pre-processing for genomics QA. Experiments on the genomics QA data set show that question analysis using the BioLexicon performs slightly better than that using n-grams and the UMLS Specialist Lexicon.

摘要

本文表明,为生物学领域量身定制的大规模词汇表对于改进基因组学问答(QA)中的问题分析是有效的。我们使用TREC基因组学跟踪数据来评估不同问题分析方法的性能。由于大量技术术语很少出现在一般英语文档和词典中,因此在生物学领域,尤其是分子生物学中处理文本信息非常困难。为了支持生物文本挖掘,我们开发了一种特定领域的资源——生物词汇表。该词汇表于2006年从零开始构建,目前包含超过四百万个生物医学术语,这些术语由新策划的术语和从现有生物医学数据库收集的术语组成。虽然传统的基因组学QA系统基于同义词库和词典提供查询扩展,但尚不清楚面向生物学的词汇资源在多大程度上对基因组学QA的问题预处理有效。在基因组学QA数据集上的实验表明,使用生物词汇表进行问题分析的性能略优于使用n-gram和UMLS专业词汇表。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验