Programa de pós-graduação em Engenharia Elétrica, Universidade Federal do Pará, Belém, Pará, Brazil.
Biological Science Institute, Universidade Federal do Pará, Belém, Pará, Brazil.
PeerJ. 2022 May 5;10:e13351. doi: 10.7717/peerj.13351. eCollection 2022.
Antimicrobial resistance is a significant public health problem worldwide. In recent years, the scientific community has been intensifying efforts to combat this problem; many experiments have been developed, and many articles are published in this area. However, the growing volume of biological literature increases the difficulty of the biocuration process due to the cost and time required. Modern text mining tools with the adoption of artificial intelligence technology are helpful to assist in the evolution of research. In this article, we propose a text mining model capable of identifying and ranking prioritizing scientific articles in the context of antimicrobial resistance. We retrieved scientific articles from the PubMed database, adopted machine learning techniques to generate the vector representation of the retrieved scientific articles, and identified their similarity with the context. As a result of this process, we obtained a dataset labeled "Relevant" and "Irrelevant" and used this dataset to implement one supervised learning algorithm to classify new records. The model's overall performance reached 90% accuracy and the f-measure (harmonic mean between the metrics) reached 82% accuracy for positive class and 93% for negative class, showing quality in the identification of scientific articles relevant to the context. The dataset, scripts and models are available at https://github.com/engbiopct/TextMiningAMR.
抗菌药物耐药性是全球范围内一个重大的公共卫生问题。近年来,科学界一直在加紧努力应对这一问题;在这一领域已经开发了许多实验,并发表了许多文章。然而,由于成本和时间的原因,生物文献的数量不断增加,增加了生物注释过程的难度。采用人工智能技术的现代文本挖掘工具有助于协助研究的发展。在本文中,我们提出了一种能够识别和优先排序抗菌药物耐药性背景下的科学文章的文本挖掘模型。我们从 PubMed 数据库中检索科学文章,采用机器学习技术生成检索科学文章的向量表示,并识别它们与上下文的相似性。通过这个过程,我们得到了一个标记为“相关”和“不相关”的数据集,并使用这个数据集来实现一个监督学习算法对新记录进行分类。该模型的整体性能达到了 90%的准确率,正类的 F1 测度(度量之间的调和平均值)达到了 82%,负类达到了 93%,表明在识别与上下文相关的科学文章方面具有良好的性能。数据集、脚本和模型可在 https://github.com/engbiopct/TextMiningAMR 上获取。