European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom ; National Institute of Informatics, Tokyo, Japan.
PLoS One. 2013 Oct 14;8(10):e72965. doi: 10.1371/journal.pone.0072965. eCollection 2013.
The identification of phenotype descriptions in the scientific literature, case reports and patient records is a rewarding task for bio-medical text mining. Any progress will support knowledge discovery and linkage to other resources. However because of their wide variation a number of challenges still remain in terms of their identification and semantic normalisation before they can be fully exploited for research purposes. This paper presents novel techniques for identifying potential complex phenotype mentions by exploiting a hybrid model based on machine learning, rules and dictionary matching. A systematic study is made of how to combine sequence labels from these modules as well as the merits of various ontological resources. We evaluated our approach on a subset of Medline abstracts cited by the Online Mendelian Inheritance of Man database related to auto-immune diseases. Using partial matching the best micro-averaged F-score for phenotypes and five other entity classes was 79.9%. A best performance of 75.3% was achieved for phenotype candidates using all semantics resources. We observed the advantage of using SVM-based learn-to-rank for sequence label combination over maximum entropy and a priority list approach. The results indicate that the identification of simple entity types such as chemicals and genes are robustly supported by single semantic resources, whereas phenotypes require combinations. Altogether we conclude that our approach coped well with the compositional structure of phenotypes in the auto-immune domain.
在科学文献、病例报告和患者记录中识别表型描述对于生物医学文本挖掘来说是一项很有价值的任务。任何进展都将支持知识发现,并与其他资源建立联系。然而,由于它们的广泛变化,在将其充分用于研究目的之前,在识别和语义规范化方面仍然存在一些挑战。本文提出了一种新的技术,通过利用基于机器学习、规则和字典匹配的混合模型来识别潜在的复杂表型提及。系统地研究了如何结合这些模块的序列标签,以及各种本体资源的优点。我们在与自身免疫性疾病相关的在线孟德尔遗传数据库引用的 Medline 摘要的一个子集中评估了我们的方法。使用部分匹配,表型和其他五个实体类的最佳微平均 F 分数为 79.9%。使用所有语义资源,对表型候选者的最佳性能为 75.3%。我们观察到使用基于 SVM 的学习排序进行序列标签组合优于最大熵和优先级列表方法的优势。结果表明,单一语义资源能够很好地支持简单实体类型(如化学物质和基因)的识别,而表型则需要组合。总的来说,我们得出的结论是,我们的方法很好地处理了自身免疫领域中表型的组合结构。