Amusat Oluwamayowa O, Hegde Harshad, Mungall Christopher J, Giannakou Anna, Byers Neil P, Gunter Dan, Fagnan Kjiersten, Ramakrishnan Lavanya
Scientific Data Division, Lawrence Berkeley National Laboratory, 1 Cyclotron road, Berkeley, CA 94720, United States.
Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, 1 Cyclotron road, Berkeley, CA 94720, United States.
Database (Oxford). 2024 Sep 27;2024. doi: 10.1093/database/baae093.
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
先进的组学技术和设施每天都会产生大量有价值的数据;然而,这些数据往往缺乏研究人员有效查找、整理和搜索所需的关键元数据。元数据的缺失给这些数据集的利用带来了重大挑战。基于机器学习(ML)的元数据提取技术已成为一种潜在可行的方法,可自动为科学数据集标注有效搜索所需的元数据。文本标注通常是手动进行的,在验证机器提取的元数据方面起着关键作用。然而,手动标注既耗时又并非总是可行;因此,需要开发自动化文本标注技术以加速科学创新进程。在环境基因组学和微生物组科学等领域,这种需求尤为迫切,因为这些领域在元数据整理和创建金标准文本挖掘数据集方面历来受到的关注较少。在本文中,我们提出了两种新颖的自动化文本标注方法,用于验证未标注文本的ML生成元数据,并在环境基因组学中有特定应用。我们的技术展示了两种利用仅适用于语料库中部分文档的现有信息来验证ML模型的新方法的潜力,然后可将这些模型用于描述语料库中的其余文档。第一种技术利用与同一研究相关的不同类型数据源之间的关系,如出版物和提案。第二种技术利用特定领域的受控词汇表或本体。在本文中,我们详细介绍了在环境基因组学研究背景下应用这些方法进行ML生成元数据验证的情况。我们的结果表明,所提出的标签分配方法可为未标注文本生成通用和高度特定的文本标签,高达44%的标签与ML关键词提取算法建议的标签匹配。