• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在暴露组探索者中使用机器学习进行生物标志物整理的信息检索

Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.

作者信息

Lamurias Andre, Jesus Sofia, Neveu Vanessa, Salek Reza M, Couto Francisco M

机构信息

LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.

International Agency for Research on Cancer, Lyon, France.

出版信息

Front Res Metr Anal. 2021 Aug 19;6:689264. doi: 10.3389/frma.2021.689264. eCollection 2021.

DOI:10.3389/frma.2021.689264
PMID:34490412
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8417071/
Abstract

In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article's relevance based on different datasets made of titles, abstracts and metadata. The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.

摘要

2016年,作为世界卫生组织一部分的国际癌症研究机构发布了暴露组浏览器(Exposome-Explorer),这是首个致力于疾病环境风险因素暴露生物标志物的数据库。该数据库内容源自人工文献检索,检索出8500多条引用,但最终数据库仅使用了其中一小部分出版物。人工整理数据库耗时且需要领域专业知识来收集分散在数百万篇文章中的相关数据。这项工作提出了一种监督式机器学习流程来辅助人工文献检索过程。暴露组浏览器中使用的人工检索的科学出版物语料库被用作机器学习模型(分类器)的训练和测试集。基于由标题、摘要和元数据组成的不同数据集,对几个参数和算法进行了评估,以预测文章的相关性。性能最佳的分类器是使用逻辑回归算法基于标题和摘要集构建的,F2分数达到70.1%。此外,我们使用为生物标志物实体识别训练的分类器从这些文章中提取了1143个实体。其中,我们人工验证了45个数据库的新候选条目。我们的方法将数据库策展人需要人工筛选的文章数量减少了近90%,而仅将22.1%的相关文章误分类。我们预计这种方法也可应用于类似的生物标志物数据集,或进行调整以辅助类似化学或疾病数据库的人工整理过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd4/8417071/709ffb0a1098/frma-06-689264-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd4/8417071/81c08448363b/frma-06-689264-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd4/8417071/709ffb0a1098/frma-06-689264-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd4/8417071/81c08448363b/frma-06-689264-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cd4/8417071/709ffb0a1098/frma-06-689264-g002.jpg

相似文献

1
Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.在暴露组探索者中使用机器学习进行生物标志物整理的信息检索
Front Res Metr Anal. 2021 Aug 19;6:689264. doi: 10.3389/frma.2021.689264. eCollection 2021.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Machine learning approach to literature mining for the genetics of complex diseases.基于机器学习的复杂疾病遗传学文献挖掘方法。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz124.
4
NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.NLM-Chem-BC7:用于生物医学文章中化学实体注释和索引的人工标注全文资源。
Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.
5
Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers.医学系统评价中筛查非随机研究:分类器的比较研究。
Artif Intell Med. 2012 Jul;55(3):197-207. doi: 10.1016/j.artmed.2012.05.002. Epub 2012 Jun 5.
6
A Text Mining Pipeline Using Active and Deep Learning Aimed at Curating Information in Computational Neuroscience.使用主动和深度学习的文本挖掘管道,旨在为计算神经科学中的信息提供支持。
Neuroinformatics. 2019 Jul;17(3):391-406. doi: 10.1007/s12021-018-9404-y.
7
Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.文本挖掘有效地对文献进行评分和排序,以提高比较毒理学基因组学数据库中的化学物质-基因-疾病的编纂工作。
PLoS One. 2013 Apr 17;8(4):e58201. doi: 10.1371/journal.pone.0058201. Print 2013.
8
An automated procedure to identify biomedical articles that contain cancer-associated gene variants.一种识别包含癌症相关基因变异的生物医学文章的自动化程序。
Hum Mutat. 2006 Sep;27(9):957-64. doi: 10.1002/humu.20363.
9
Exposome-Explorer: a manually-curated database on biomarkers of exposure to dietary and environmental factors.暴露组探索者:一个关于饮食和环境因素暴露生物标志物的人工整理数据库。
Nucleic Acids Res. 2017 Jan 4;45(D1):D979-D984. doi: 10.1093/nar/gkw980. Epub 2016 Oct 24.
10
BioReader: a text mining tool for performing classification of biomedical literature.BioReader:一种文本挖掘工具,用于对生物医学文献进行分类。
BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):57. doi: 10.1186/s12859-019-2607-x.

引用本文的文献

1
[Medical education and artificial intelligence: perspectives and ethical challenges].[医学教育与人工智能:观点与伦理挑战]
Rev Med Inst Mex Seguro Soc. 2025 Aug 14;63(5):e6736. doi: 10.5281/zenodo.16748310.
2
Reconceptualizing and Defining Exposomics within Environmental Health: Expanding the Scope of Health Research.重新概念化和定义环境健康中的暴露组学:扩展健康研究的范围。
Environ Health Perspect. 2024 Sep;132(9):95001. doi: 10.1289/EHP14509. Epub 2024 Sep 27.
3
The human microbial exposome: expanding the Exposome-Explorer database with gut microbial metabolites.

本文引用的文献

1
A High Recall Classifier for Selecting Articles for MEDLINE Indexing.一种用于为MEDLINE索引选择文章的高召回率分类器。
AMIA Annu Symp Proc. 2020 Mar 4;2019:727-734. eCollection 2019.
2
Text-mining clinically relevant cancer biomarkers for curation into the CIViC database.从临床相关癌症生物标志物文本中挖掘信息,将其纳入 CIViC 数据库。
Genome Med. 2019 Dec 3;11(1):78. doi: 10.1186/s13073-019-0686-y.
3
Exposome-Explorer 2.0: an update incorporating candidate dietary biomarkers and dietary associations with cancer risk.Exposome-Explorer 2.0:更新版本纳入候选膳食生物标志物以及膳食与癌症风险的关联。
人类微生物暴露组:用肠道微生物代谢物扩展暴露组探索者数据库。
Sci Rep. 2023 Feb 2;13(1):1946. doi: 10.1038/s41598-022-26366-w.
4
A Narrative Literature Review of Natural Language Processing Applied to the Occupational Exposome.自然语言处理在职业外核组学中的应用的叙事文献综述。
Int J Environ Res Public Health. 2022 Jul 13;19(14):8544. doi: 10.3390/ijerph19148544.
Nucleic Acids Res. 2020 Jan 8;48(D1):D908-D912. doi: 10.1093/nar/gkz1009.
4
Characterizing the Scope of Exposome Research Through Topic Modeling and Ontology Analysis.通过主题建模和本体分析来刻画暴露组研究的范围
Stud Health Technol Inform. 2019 Aug 21;264:1530-1531. doi: 10.3233/SHTI190519.
5
MER: a shell script and annotation server for minimal named entity recognition and linking.MER:用于最小命名实体识别与链接的 shell 脚本及注释服务器。
J Cheminform. 2018 Dec 5;10(1):58. doi: 10.1186/s13321-018-0312-9.
6
Characterising the Scope of Exposome Research: A Generalisable Approach.描述暴露组研究的范围:一种可推广的方法。
Stud Health Technol Inform. 2017;245:457-461.
7
CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer.CIViC 是一个社区知识库,用于专家众包对癌症变异的临床解释。
Nat Genet. 2017 Jan 31;49(2):170-174. doi: 10.1038/ng.3774.
8
Exposome-Explorer: a manually-curated database on biomarkers of exposure to dietary and environmental factors.暴露组探索者:一个关于饮食和环境因素暴露生物标志物的人工整理数据库。
Nucleic Acids Res. 2017 Jan 4;45(D1):D979-D984. doi: 10.1093/nar/gkw980. Epub 2016 Oct 24.
9
mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support.mycoCLAP,真菌来源的木质纤维素活性蛋白特征数据库:资源与文本挖掘管理支持
Database (Oxford). 2015 Mar 8;2015. doi: 10.1093/database/bav008. Print 2015.
10
Machine learning for biomedical literature triage.用于生物医学文献分类的机器学习
PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014.