School of ITEE, The University of Queensland, Australia.
PLoS One. 2013;8(2):e55656. doi: 10.1371/journal.pone.0055656. Epub 2013 Feb 8.
Phenotype descriptions are important for our understanding of genetics, as they enable the computation and analysis of a varied range of issues related to the genetic and developmental bases of correlated characters. The literature contains a wealth of such phenotype descriptions, usually reported as free-text entries, similar to typical clinical summaries. In this paper, we focus on creating and making available an annotated corpus of skeletal phenotype descriptions. In addition, we present and evaluate a hybrid Machine Learning approach for mining phenotype descriptions from free text. Our hybrid approach uses an ensemble of four classifiers and experiments with several aggregation techniques. The best scoring technique achieves an F-1 score of 71.52%, which is close to the state-of-the-art in other domains, where training data exists in abundance. Finally, we discuss the influence of the features chosen for the model on the overall performance of the method.
表型描述对于我们理解遗传学很重要,因为它们能够计算和分析与相关特征的遗传和发育基础相关的各种问题。文献中包含大量这样的表型描述,通常以自由文本形式报告,类似于典型的临床总结。在本文中,我们专注于创建和提供一个注释的骨骼表型描述语料库。此外,我们还提出并评估了一种从自由文本中挖掘表型描述的混合机器学习方法。我们的混合方法使用了四个分类器的集成,并尝试了几种聚合技术。得分最高的技术达到了 71.52%的 F1 分数,这接近在其他领域的最新水平,在这些领域中,训练数据非常丰富。最后,我们讨论了模型选择的特征对方法整体性能的影响。