Caporaso J Gregory, Baumgartner William A, Randolph David A, Cohen K Bretonnel, Hunter Lawrence
Department of Biochemistry and Molecular Genetics, Center for Computational Pharmacology, University of Colorado Health Sciences Center, Aurora, CO, USA.
J Bioinform Comput Biol. 2007 Dec;5(6):1233-59. doi: 10.1142/s0219720007003144.
The primary biomedical literature is being generated at an unprecedented rate, and researchers cannot keep abreast of new developments in their fields. Biomedical natural language processing is being developed to address this issue, but building reliable systems often requires many expert-hours. We present an approach for automatically developing collections of regular expressions to drive high-performance concept recognition systems with minimal human interaction. We applied our approach to develop MutationFinder, a system for automatically extracting mentions of point mutations from the text. MutationFinder achieves performance equivalent to or better than manually developed mutation recognition systems, but the generation of its 759 patterns has required only 5.5 expert-hours. We also discuss the development and evaluation of our recently published high-quality, human-annotated gold standard corpus, which contains 1,515 complete point mutation mentions annotated in 813 abstracts. Both MutationFinder and the complete corpus are publicly available at (http://mutationfinder.sourceforge.net/).
生物医学领域的原始文献正以前所未有的速度增长,研究人员无法跟上其所在领域的新进展。生物医学自然语言处理技术正在发展以解决这一问题,但构建可靠的系统通常需要耗费大量专家工时。我们提出了一种方法,可自动开发正则表达式集合,以驱动高性能概念识别系统,且只需最少的人工干预。我们应用此方法开发了MutationFinder,这是一个用于从文本中自动提取点突变提及内容的系统。MutationFinder的性能与手动开发的突变识别系统相当甚至更优,但其759个模式的生成仅需5.5个专家工时。我们还讨论了我们最近发布的高质量、人工标注的黄金标准语料库的开发和评估,该语料库包含在813篇摘要中注释的1515个完整点突变提及内容。MutationFinder和完整语料库均可在(http://mutationfinder.sourceforge.net/)上公开获取。