Vishnyakova Dina, Pasche Emilie, Teodoro Douglas, Lovis Christian, Ruch Patrick
BiTeM Group.
Stud Health Technol Inform. 2012;174:89-93.
We present a new approach for pathogens and gene product normalization in the biomedical literature. The idea of this approach was motivated by needs such as literature curation, in particular applied to the field of infectious diseases thus, variants of bacterial species (S. aureus, Staphyloccocus aureus, ...) and their gene products (protein ArsC, Arsenical pump modifier, Arsenate reductase, ...). Our approach is based on the use of an Ontology Look-up Service, a Gene Ontology Categorizer (GOCat) and Gene Normalization methods. In the pathogen detection task the use of OLS disambiguates found pathogen names. GOCat results are incorporated into overall score system to support and to confirm the decisionmaking in normalization process of pathogens and their genomes. The evaluation was done on two test sets of BioCreativeIII benchmark: gold standard of manual curation (50 articles) and silver standard (507 articles) curated by collective results of BCIII participants. For the cross-species GN we achieved the precision of 46% for silver and 27% for gold sets. Pathogen normalization results showed 95% of precision and 93% of recall. The impact of GOCat explicitly improves results of pathogen and gene normalization, basically confirming identified pathogens and boosting correct gene identifiers on the top of the results' list ranked by confidence. A correct identification of the pathogen is able to improve significantly normalization effectiveness and to solve the disambiguation problem of genes.
我们提出了一种用于生物医学文献中病原体和基因产物标准化的新方法。这种方法的理念源于文献编目等需求,尤其适用于传染病领域,因此涉及细菌物种的变体(金黄色葡萄球菌、金黄色酿脓葡萄球菌等)及其基因产物(蛋白质ArsC、砷泵修饰剂、砷酸盐还原酶等)。我们的方法基于本体查找服务、基因本体分类器(GOCat)和基因标准化方法的使用。在病原体检测任务中,OLS的使用消除了所发现病原体名称的歧义。GOCat的结果被纳入总体评分系统,以支持并确认病原体及其基因组标准化过程中的决策。评估是在BioCreativeIII基准的两个测试集上进行的:手动编目的金标准(50篇文章)和由BCIII参与者的集体结果编目的银标准(507篇文章)。对于跨物种基因标准化,银标准集的精确率为46%,金标准集为27%。病原体标准化结果显示精确率为95%,召回率为93%。GOCat的影响显著提高了病原体和基因标准化的结果,基本上确认了已识别的病原体,并在按置信度排序的结果列表顶部提高了正确基因标识符的比例。病原体的正确识别能够显著提高标准化效果,并解决基因的歧义问题。