Crasto Chiquito J, Marenco Luis N, Migliore Michele, Mao Buqing, Nadkarni Prakash M, Miller Perry, Shepherd Gordon M
Center for Medical Informatics, Yale University, New Haven, CT, USA.
Neuroinformatics. 2003;1(3):215-37. doi: 10.1385/NI:1:3:215.
We have developed a program NeuroText to populate the neuroscience databases in SenseLab (http://senselab.med.yale.edu/senselab) by mining the natural language text of neuroscience articles. NeuroText uses a two-step approach to identify relevant articles. The first step (pre-processing), aimed at 100% sensitivity, identifies abstracts containing database keywords. In the second step, potentially relevant abstracts identified in the first step are processed for specificity dictated by database architecture, and neuroscience, lexical and semantic contexts. NeuroText results were presented to the experts for validation using a dynamically generated interface that also allows expert-validated articles to be automatically deposited into the databases. Of the test set of 912 articles, 735 were rejected at the pre-processing step. For the remaining articles, the accuracy of predicting database-relevant articles was 85%. Twenty-two articles were erroneously identified. NeuroText deferred decisions on 29 articles to the expert. A comparison of NeuroText results versus the experts' analyses revealed that the program failed to correctly identify articles' relevance due to concepts that did not yet exist in the knowledgebase or due to vaguely presented information in the abstracts. NeuroText uses two "evolution" techniques (supervised and unsupervised) that play an important role in the continual improvement of the retrieval results. Software that uses the NeuroText approach can facilitate the creation of curated, special-interest, bibliography databases.
我们开发了一个名为NeuroText的程序,通过挖掘神经科学文章的自然语言文本,来填充SenseLab(http://senselab.med.yale.edu/senselab)中的神经科学数据库。NeuroText采用两步法来识别相关文章。第一步(预处理)旨在实现100%的灵敏度,识别包含数据库关键词的摘要。在第二步中,对第一步中识别出的潜在相关摘要进行处理,以符合数据库架构、神经科学、词汇和语义上下文所要求的特异性。NeuroText的结果通过一个动态生成的界面呈现给专家进行验证,该界面还允许将经过专家验证的文章自动存入数据库。在912篇文章的测试集中,有735篇在预处理步骤被拒绝。对于其余文章,预测与数据库相关文章的准确率为85%。有22篇文章被错误识别。NeuroText将29篇文章的决策推迟给专家。将NeuroText的结果与专家分析进行比较发现,由于知识库中尚不存在的概念或摘要中呈现的信息模糊,该程序未能正确识别文章的相关性。NeuroText使用两种“进化”技术(监督式和非监督式),它们在不断改进检索结果方面发挥着重要作用。采用NeuroText方法的软件可以促进创建经过策划的、具有特殊兴趣的书目数据库。