Suppr超能文献

在暴露组探索者中使用机器学习进行生物标志物整理的信息检索

Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.

作者信息

Lamurias Andre, Jesus Sofia, Neveu Vanessa, Salek Reza M, Couto Francisco M

机构信息

LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.

International Agency for Research on Cancer, Lyon, France.

出版信息

Front Res Metr Anal. 2021 Aug 19;6:689264. doi: 10.3389/frma.2021.689264. eCollection 2021.

Abstract

In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article's relevance based on different datasets made of titles, abstracts and metadata. The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.

摘要

2016年,作为世界卫生组织一部分的国际癌症研究机构发布了暴露组浏览器(Exposome-Explorer),这是首个致力于疾病环境风险因素暴露生物标志物的数据库。该数据库内容源自人工文献检索,检索出8500多条引用,但最终数据库仅使用了其中一小部分出版物。人工整理数据库耗时且需要领域专业知识来收集分散在数百万篇文章中的相关数据。这项工作提出了一种监督式机器学习流程来辅助人工文献检索过程。暴露组浏览器中使用的人工检索的科学出版物语料库被用作机器学习模型(分类器)的训练和测试集。基于由标题、摘要和元数据组成的不同数据集,对几个参数和算法进行了评估,以预测文章的相关性。性能最佳的分类器是使用逻辑回归算法基于标题和摘要集构建的,F2分数达到70.1%。此外,我们使用为生物标志物实体识别训练的分类器从这些文章中提取了1143个实体。其中,我们人工验证了45个数据库的新候选条目。我们的方法将数据库策展人需要人工筛选的文章数量减少了近90%,而仅将22.1%的相关文章误分类。我们预计这种方法也可应用于类似的生物标志物数据集,或进行调整以辅助类似化学或疾病数据库的人工整理过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd4/8417071/81c08448363b/frma-06-689264-g001.jpg

相似文献

1
Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.
Front Res Metr Anal. 2021 Aug 19;6:689264. doi: 10.3389/frma.2021.689264. eCollection 2021.
3
Machine learning approach to literature mining for the genetics of complex diseases.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz124.
5
Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers.
Artif Intell Med. 2012 Jul;55(3):197-207. doi: 10.1016/j.artmed.2012.05.002. Epub 2012 Jun 5.
9
Exposome-Explorer: a manually-curated database on biomarkers of exposure to dietary and environmental factors.
Nucleic Acids Res. 2017 Jan 4;45(D1):D979-D984. doi: 10.1093/nar/gkw980. Epub 2016 Oct 24.
10
BioReader: a text mining tool for performing classification of biomedical literature.
BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):57. doi: 10.1186/s12859-019-2607-x.

引用本文的文献

1
[Medical education and artificial intelligence: perspectives and ethical challenges].
Rev Med Inst Mex Seguro Soc. 2025 Aug 14;63(5):e6736. doi: 10.5281/zenodo.16748310.
2
Reconceptualizing and Defining Exposomics within Environmental Health: Expanding the Scope of Health Research.
Environ Health Perspect. 2024 Sep;132(9):95001. doi: 10.1289/EHP14509. Epub 2024 Sep 27.
4
A Narrative Literature Review of Natural Language Processing Applied to the Occupational Exposome.
Int J Environ Res Public Health. 2022 Jul 13;19(14):8544. doi: 10.3390/ijerph19148544.

本文引用的文献

1
A High Recall Classifier for Selecting Articles for MEDLINE Indexing.
AMIA Annu Symp Proc. 2020 Mar 4;2019:727-734. eCollection 2019.
2
Text-mining clinically relevant cancer biomarkers for curation into the CIViC database.
Genome Med. 2019 Dec 3;11(1):78. doi: 10.1186/s13073-019-0686-y.
4
Characterizing the Scope of Exposome Research Through Topic Modeling and Ontology Analysis.
Stud Health Technol Inform. 2019 Aug 21;264:1530-1531. doi: 10.3233/SHTI190519.
5
MER: a shell script and annotation server for minimal named entity recognition and linking.
J Cheminform. 2018 Dec 5;10(1):58. doi: 10.1186/s13321-018-0312-9.
8
Exposome-Explorer: a manually-curated database on biomarkers of exposure to dietary and environmental factors.
Nucleic Acids Res. 2017 Jan 4;45(D1):D979-D984. doi: 10.1093/nar/gkw980. Epub 2016 Oct 24.
10
Machine learning for biomedical literature triage.
PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验