Amol University of Special Modern Technologies, Mazandaran, Iran.
School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran.
BMC Bioinformatics. 2023 Apr 12;24(1):144. doi: 10.1186/s12859-023-05236-w.
Extraction of associations of singular nucleotide polymorphism (SNP) and phenotypes from biomedical literature is a vital task in BioNLP. Recently, some methods have been developed to extract mutation-diseases affiliations. However, no accessible method of extracting associations of SNP-phenotype from content considers their degree of certainty. In this paper, several machine learning methods were developed to extract ranked SNP-phenotype associations from biomedical abstracts and then were compared to each other. In addition, shallow machine learning methods, including random forest, logistic regression, and decision tree and two kernel-based methods like subtree and local context, a rule-based and a deep CNN-LSTM-based and two BERT-based methods were developed in this study to extract associations. Furthermore, the experiments indicated that although the used linguist features could be employed to implement a superior association extraction method outperforming the kernel-based counterparts, the used deep learning and BERT-based methods exhibited the best performance. However, the used PubMedBERT-LSTM outperformed the other developed methods among the used methods. Moreover, similar experiments were conducted to estimate the degree of certainty of the extracted association, which can be used to assess the strength of the reported association. The experiments revealed that our proposed PubMedBERT-CNN-LSTM method outperformed the sophisticated methods on the task.
从生物医学文献中提取单核苷酸多态性 (SNP) 和表型的关联是生物自然语言处理中的一项重要任务。最近,已经开发了一些方法来提取突变-疾病关联。然而,从内容中提取 SNP-表型关联的可用方法都没有考虑它们的确定性程度。在本文中,开发了几种机器学习方法从生物医学摘要中提取 SNP-表型的关联,并对它们进行了比较。此外,还开发了浅层机器学习方法,包括随机森林、逻辑回归、决策树和基于子树和局部上下文的两种核方法,基于规则的和基于深度 CNN-LSTM 的以及基于 BERT 的两种方法来提取关联。此外,实验表明,尽管使用的语言特征可用于实现优于基于核的方法的优秀关联提取方法,但使用的深度学习和基于 BERT 的方法表现出最佳性能。然而,在使用的方法中,PubMedBERT-LSTM 的表现优于其他开发的方法。此外,还进行了类似的实验来估计提取关联的确定性程度,这可用于评估报告关联的强度。实验表明,我们提出的 PubMedBERT-CNN-LSTM 方法在该任务上优于复杂的方法。