Vishnyakova Dina, Pasche Emilie, Gobeill Julien, Gaudinat Arnaud, Lovis Christian, Ruch Patrick
University Hospitals of Geneva, Geneva, Switzerland.
Stud Health Technol Inform. 2012;180:210-4.
We present a new approach to perform biomedical documents classification and prioritization for the Comparative Toxicogenomics Database (CTD). This approach is motivated by needs such as literature curation, in particular applied to the human health environment domain. The unique integration of chemical, genes/proteins and disease data in the biomedical literature may advance the identification of exposure and disease biomarkers, mechanisms of chemical actions, and the complex aetiologies of chronic diseases. Our approach aims to assist biomedical researchers when searching for relevant articles for CTD. The task is functionally defined as a binary classification task, where selected articles must also be ranked by order of relevance. We design a SVM classifier, which combines three main feature sets: an information retrieval system (EAGLi), a biomedical named-entity recognizer (MeSH term extraction), a gene normalization (GN) service (NormaGene) and an ad-hoc keyword recognizer for diseases and chemicals. The evaluation of the gene identification module was done on BioCreativeIII test data. Disease normalization is achieved with 95% precision and 93% of recall. The evaluation of the classification was done on the corpus provided by BioCreative organizers in 2012. The approach showed promising performance on the test data.
我们提出了一种新方法,用于对比较毒理基因组学数据库(CTD)的生物医学文献进行分类和排序。这种方法是由文献编目等需求推动的,特别是应用于人类健康环境领域。生物医学文献中化学、基因/蛋白质和疾病数据的独特整合,可能会推动暴露和疾病生物标志物的识别、化学作用机制以及慢性病复杂病因的研究。我们的方法旨在帮助生物医学研究人员为CTD搜索相关文章。该任务在功能上被定义为一个二元分类任务,其中所选文章还必须按相关性顺序进行排序。我们设计了一个支持向量机分类器,它结合了三个主要特征集:一个信息检索系统(EAGLi)、一个生物医学命名实体识别器(医学主题词提取)、一个基因归一化(GN)服务(NormaGene)以及一个针对疾病和化学物质的临时关键词识别器。基因识别模块的评估是在BioCreativeIII测试数据上进行的。疾病归一化的精确率达到95%,召回率达到93%。分类评估是在BioCreative组织者于2012年提供的语料库上进行的。该方法在测试数据上显示出了良好的性能。