Ikeda Shuya, Zou Zhaonan, Bono Hidemasa, Moriya Yuki, Kawashima Shuichi, Katayama Toshiaki, Oki Shinya, Ohta Tazro
Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Univ. of Tokyo Kashiwanoha-campus Station Satellite 6F. 178-4-4 Wakashiba, Kashiwa-shi, Chiba 277-0871, JAPAN.
Graduate School of Integrated Sciences for Life, Hiroshima University, 1-4-4 Kagamiyama, Higashihiroshima-shi, Hiroshima 739-8528, JAPAN.
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf070.
BioSample is a repository of experimental sample metadata. It is a comprehensive archive that enables searches of experiments, regardless of type. However, there is substantial variability in the submitted metadata due to the difficulty in defining comprehensive rules for describing them and the limited user awareness of best practices in creating them. This inconsistency poses considerable challenges to the findability and reusability of archived data. Given the scale of BioSample, which hosts over 40 million records, manual curation is impractical. Automatic rule-based ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of the metadata. Recently, large language models (LLMs) have gained attention in natural language processing and are promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data in which samples were manually curated. The LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended them to extract information about experimentally manipulated genes from metadata when manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results, including the facilitation of more precise filtering of the data and the prevention of possible misinterpretations caused by the inclusion of unintended data. These findings underscore the potential of LLMs in improving the findability and reusability of experimental data in general, which would considerably reduce the user workload and enable more effective scientific data management.
生物样本库是实验样本元数据的储存库。它是一个全面的存档库,支持对各种类型的实验进行搜索。然而,由于难以定义描述元数据的全面规则,且用户对创建元数据的最佳实践认识有限,提交的元数据存在很大差异。这种不一致性给存档数据的可查找性和可重用性带来了相当大的挑战。鉴于生物样本库规模庞大,存储了超过4000万条记录,人工整理是不切实际的。已提出基于规则的自动本体映射方法来解决这个问题,但其有效性受到元数据异质性的限制。最近,大语言模型(LLMs)在自然语言处理中受到关注,是用于自动进行元数据整理的有前途的工具。在本研究中,我们使用从ChIP-Atlas(一个表观基因组学实验数据的二级数据库,其中样本是人工整理的)衍生的金标准数据集,评估了大语言模型从生物样本描述中提取细胞系名称的性能。大语言模型辅助方法优于传统方法,具有更高的准确性和覆盖率。我们进一步将它们扩展,以便在ChIP-Atlas尚未应用人工整理时从元数据中提取有关实验操作基因的信息。这也产生了成功的结果,包括促进对数据进行更精确的筛选,以及防止因包含非预期数据而可能导致的误解。这些发现强调了大语言模型在总体上提高实验数据的可查找性和可重用性方面的潜力,这将大大减少用户工作量,并实现更有效的科学数据管理。