National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK.
Bioinformatics. 2010 Mar 1;26(5):661-7. doi: 10.1093/bioinformatics/btq002. Epub 2010 Jan 6.
Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers.
We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification.
The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/
文本挖掘技术已被证明可以减少组织文献中隐藏的大量信息所涉及的繁琐工作。文本挖掘中的一个挑战是将模糊的词形与明确的生物概念联系起来。本文报道了一项关于解决生物医学命名实体提及中与模式生物有关的歧义的综合研究,并提出了一系列方法,重点是利用自然语言解析器的方法。
我们构建了一个用于生物分类歧义消解的语料库,其中蛋白质/基因实体的每个出现都被手动标记为物种 ID,并在其上评估了多种方法。通过对句法解析树进行机器学习模型训练,获得了有希望的结果,然后使用该模型来确定实体是否属于由相邻物种指示词(例如酵母)表示的模式生物。基于解析器的方法也与有监督的分类方法进行了比较,结果表明,当关注领域可移植性时,前者是更可取的选择。通过结合句法特征和有监督分类的优势,可以获得最佳的整体性能。
语料库和演示可在 http://www.nactem.ac.uk/deca_details/start.cgi 上获得,软件可作为 U-Compare 组件免费获得(Kano 等人,2009):NaCTeM 物种词检测器和 NaCTeM 物种消解器。U-Compare 可在 http://-compare.org/ 上获得。