IEEE Trans Nanobioscience. 2018 Jul;17(3):172-180. doi: 10.1109/TNB.2018.2838137. Epub 2018 May 18.
As a significant determinant in the development of named entity recognition, phenotypic descriptions are normally presented differently in biomedical literature with the use of complicated semantics. In this paper, a novel approach has been proposed to identify plant phenotypes by adopting word embedding to sentence embedding cascaded approach. We make use of a word embedding method to find high-frequency phenotypes with original sentences used as input in a sentence embedding method. In doing so, a variety of complicated phenotypic expressions can be recognized accurately. Besides, the state-of-the-art word representation models have been compared and among them, skip-gram with negative sampling was selected with the best performance. To evaluate the performance of our approach, we applied it to the dataset composed of 56 748 PubMed abstracts of model organism Arabidopsis thaliana. The experiment results showed that our approach yielded the best performance, as it achieved a 2.588-fold increase in terms of the number of new phenotypic descriptions when compared to the original phenotype ontology.
作为命名实体识别发展的重要决定因素,表型描述通常在生物医学文献中使用复杂的语义呈现不同的方式。在本文中,我们提出了一种新的方法,通过采用词嵌入到句子嵌入级联的方法来识别植物表型。我们利用词嵌入方法,通过将原始句子作为输入在句子嵌入方法中找到高频表型。通过这种方式,可以准确识别各种复杂的表型表达。此外,我们比较了最先进的词表示模型,其中选择了表现最好的 skip-gram with negative sampling。为了评估我们方法的性能,我们将其应用于由 56748 篇拟南芥 PubMed 摘要组成的数据集。实验结果表明,与原始表型本体相比,我们的方法在新表型描述的数量上取得了 2.588 倍的提高,表现最佳。