Dobrokhotov Pavel B, Goutte Cyril, Veuthey Anne-Lise, Gaussier Eric
Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Genève 4, Switzerland.
Int J Med Inform. 2005 Mar;74(2-4):317-24. doi: 10.1016/j.ijmedinf.2004.04.017.
Bio-medical knowledge bases are valuable resources for the research community. Original scientific publications are the main source used to annotate them. Medical annotation in Swiss-Prot is specifically targeted at finding and extracting data about human genetic diseases and polymorphisms. Curators have to scan through hundreds of publications to select the relevant ones. This workload can be greatly reduced by using bio-text mining techniques. Using a combination of natural language processing (NLP) techniques and statistical classifiers, we achieve recall points of up to 84% on the potentially interesting documents and a precision of more than 96% in detecting irrelevant documents. Careful analysis of the document pre-processing chain allows us to measure the impact of some steps on the overall result, as well as test different classifier configurations. The best combination was used to create a prototype of a search and classification tool that is currently tested by the database curators.
生物医学知识库是研究界的宝贵资源。原始科学出版物是用于注释这些知识库的主要来源。瑞士蛋白质数据库(Swiss-Prot)中的医学注释专门针对查找和提取有关人类遗传疾病和多态性的数据。编审们必须浏览数百篇出版物以挑选出相关的文献。通过使用生物文本挖掘技术,这项工作量可以大幅减少。结合自然语言处理(NLP)技术和统计分类器,我们在潜在有趣的文档上实现了高达84%的召回率,在检测不相关文档方面的精确率超过96%。对文档预处理链的仔细分析使我们能够衡量某些步骤对整体结果的影响,并测试不同的分类器配置。最佳组合被用于创建一个搜索和分类工具的原型,目前该原型正由数据库编审进行测试。