• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于基于机器学习的关键短语提取与验证的科学文本自动标注

Automated annotation of scientific texts for ML-based keyphrase extraction and validation.

作者信息

Amusat Oluwamayowa O, Hegde Harshad, Mungall Christopher J, Giannakou Anna, Byers Neil P, Gunter Dan, Fagnan Kjiersten, Ramakrishnan Lavanya

机构信息

Scientific Data Division, Lawrence Berkeley National Laboratory, 1 Cyclotron road, Berkeley, CA 94720, United States.

Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, 1 Cyclotron road, Berkeley, CA 94720, United States.

出版信息

Database (Oxford). 2024 Sep 27;2024. doi: 10.1093/database/baae093.

DOI:10.1093/database/baae093
PMID:39331731
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11959184/
Abstract

Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.

摘要

先进的组学技术和设施每天都会产生大量有价值的数据;然而,这些数据往往缺乏研究人员有效查找、整理和搜索所需的关键元数据。元数据的缺失给这些数据集的利用带来了重大挑战。基于机器学习(ML)的元数据提取技术已成为一种潜在可行的方法,可自动为科学数据集标注有效搜索所需的元数据。文本标注通常是手动进行的,在验证机器提取的元数据方面起着关键作用。然而,手动标注既耗时又并非总是可行;因此,需要开发自动化文本标注技术以加速科学创新进程。在环境基因组学和微生物组科学等领域,这种需求尤为迫切,因为这些领域在元数据整理和创建金标准文本挖掘数据集方面历来受到的关注较少。在本文中,我们提出了两种新颖的自动化文本标注方法,用于验证未标注文本的ML生成元数据,并在环境基因组学中有特定应用。我们的技术展示了两种利用仅适用于语料库中部分文档的现有信息来验证ML模型的新方法的潜力,然后可将这些模型用于描述语料库中的其余文档。第一种技术利用与同一研究相关的不同类型数据源之间的关系,如出版物和提案。第二种技术利用特定领域的受控词汇表或本体。在本文中,我们详细介绍了在环境基因组学研究背景下应用这些方法进行ML生成元数据验证的情况。我们的结果表明,所提出的标签分配方法可为未标注文本生成通用和高度特定的文本标签,高达44%的标签与ML关键词提取算法建议的标签匹配。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/33668f060d39/baae093fa1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/d988af5e78ff/baae093f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/bae54ac61fbb/baae093f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/884fc3f81875/baae093f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/c3283b60e84c/baae093f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/0ff38de4efe3/baae093f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/efc8f8c5a8a1/baae093f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/80cc8b8131ec/baae093f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/33668f060d39/baae093fa1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/d988af5e78ff/baae093f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/bae54ac61fbb/baae093f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/884fc3f81875/baae093f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/c3283b60e84c/baae093f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/0ff38de4efe3/baae093f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/efc8f8c5a8a1/baae093f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/80cc8b8131ec/baae093f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd92/11959184/33668f060d39/baae093fa1.jpg

相似文献

1
Automated annotation of scientific texts for ML-based keyphrase extraction and validation.用于基于机器学习的关键短语提取与验证的科学文本自动标注
Database (Oxford). 2024 Sep 27;2024. doi: 10.1093/database/baae093.
2
Data stewardship and curation practices in AI-based genomics and automated microscopy image analysis for high-throughput screening studies: promoting robust and ethical AI applications.基于人工智能的基因组学和用于高通量筛选研究的自动显微镜图像分析中的数据管理与整理实践:推动可靠且符合伦理的人工智能应用。
Hum Genomics. 2025 Feb 23;19(1):16. doi: 10.1186/s40246-025-00716-x.
3
ALE: automated label extraction from GEO metadata.ALE:从 GEO 元数据中自动提取标签。
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509. doi: 10.1186/s12859-017-1888-1.
4
Strategies towards digital and semi-automated curation in RegulonDB.RegulonDB中数字和半自动管理的策略。
Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax012.
5
PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。
J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
OMD Curation Toolkit: a workflow for in-house curation of public omics datasets.OMD 策管工具包:公共组学数据集内部策管工作流程。
BMC Bioinformatics. 2024 May 9;25(1):184. doi: 10.1186/s12859-024-05803-9.
8
A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models.一种使用布局识别和字符串模式搜索模型从扫描文档中自动提取元数据的系统。
Archiving. 2009;1509STP:107-112.
9
GeMI: interactive interface for transformer-based Genomic Metadata Integration.GeMI:基于转换器的基因组元数据集成的交互式接口。
Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.
10
Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.Textpresso 中心:一个可定制的平台,用于搜索、文本挖掘、查看和管理生物医学文献。
BMC Bioinformatics. 2018 Mar 9;19(1):94. doi: 10.1186/s12859-018-2103-8.

本文引用的文献

1
GeneGPT: augmenting large language models with domain tools for improved access to biomedical information.GeneGPT:利用领域工具增强大型语言模型,以改善对生物医学信息的访问。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae075.
2
Multiple microbial guilds mediate soil methane cycling along a wetland salinity gradient.多个微生物类群沿湿地盐度梯度调控土壤甲烷循环。
mSystems. 2024 Jan 23;9(1):e0093623. doi: 10.1128/msystems.00936-23. Epub 2024 Jan 3.
3
Reproducible growth of in EcoFAB 2.0 reveals that nitrogen form and starvation modulate root exudation.
在 EcoFAB 2.0 中可重复性生长表明氮形态和饥饿会调节根系分泌物。
Sci Adv. 2024 Jan 5;10(1):eadg7888. doi: 10.1126/sciadv.adg7888. Epub 2024 Jan 3.
4
OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies.2021 年的 OBO 基金会:运用开放数据原则来评估本体论。
Database (Oxford). 2021 Oct 26;2021. doi: 10.1093/database/baab069.
5
The Gene Ontology resource: enriching a GOld mine.基因本体论资源:丰富一个 GOld 矿。
Nucleic Acids Res. 2021 Jan 8;49(D1):D325-D334. doi: 10.1093/nar/gkaa1113.
6
OGER++: hybrid multi-type entity recognition.OGER++:混合多类型实体识别
J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.
7
The Planteome database: an integrated resource for reference ontologies, plant genomics and phenomics.Planteome 数据库:参考本体、植物基因组学和表型组学的综合资源。
Nucleic Acids Res. 2018 Jan 4;46(D1):D1168-D1180. doi: 10.1093/nar/gkx1152.
8
Entity recognition in the biomedical domain using a hybrid approach.使用混合方法进行生物医学领域的实体识别。
J Biomed Semantics. 2017 Nov 9;8(1):51. doi: 10.1186/s13326-017-0157-6.
9
The anatomy of phenotype ontologies: principles, properties and applications.表型本体论的剖析:原理、性质与应用。
Brief Bioinform. 2018 Sep 28;19(5):1008-1021. doi: 10.1093/bib/bbx035.
10
The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation.2016年的环境本体:通过扩大范围、增加语义密度和实现互操作性来弥合各领域之间的差距。
J Biomed Semantics. 2016 Sep 23;7(1):57. doi: 10.1186/s13326-016-0097-6.