生成式人工智能时代的生物数据库。

Biological databases in the age of generative artificial intelligence.

作者信息

Pop Mihai, Attwood Teresa K, Blake Judith A, Bourne Philip E, Conesa Ana, Gaasterland Terry, Hunter Lawrence, Kingsford Carl, Kohlbacher Oliver, Lengauer Thomas, Markel Scott, Moreau Yves, Noble William S, Orengo Christine, Ouellette B F Francis, Parida Laxmi, Przulj Natasa, Przytycka Teresa M, Ranganathan Shoba, Schwartz Russell, Valencia Alfonso, Warnow Tandy

机构信息

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, United States.

Department of Computer Science, The University of Manchester, Manchester M13 9PL, United Kingdom.

出版信息

Bioinform Adv. 2025 Mar 20;5(1):vbaf044. doi: 10.1093/bioadv/vbaf044. eCollection 2025.

DOI:10.1093/bioadv/vbaf044

PMID:40177265

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11964588/

Abstract

SUMMARY

Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.

AVAILABILITY AND IMPLEMENTATION

Not applicable.

摘要

现代生物学研究严重依赖公共数据库。数据库内部和之间错误的引入与传播可能导致资源浪费，因为科学家会被错误数据误导，或者不得不进行昂贵的验证实验。生成式人工智能系统的出现可能会使这个问题更加严重，因为生成大量合成数据非常容易。我们概述了生物数据生态系统中出现的几个关键问题，并提出了一些旨在减少数据错误及其传播的建议。我们特别强调了改进针对生物学家和生命科学家的教育项目的至关重要性，这些项目应强调数据工程的最佳实践。我们还主张增加对数据来源、错误传播以及理解错误对分析管道影响的理论和实证研究。此外，我们建议增加对公共生物数据库管理和维护的资金投入。

可用性与实施

不适用。

相似文献

Biological databases in the age of generative artificial intelligence.生成式人工智能时代的生物数据库。

Bioinform Adv. 2025 Mar 20;5(1):vbaf044. doi: 10.1093/bioadv/vbaf044. eCollection 2025.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果：面向临床医生的网状Meta分析教程

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Community views on mass drug administration for soil-transmitted helminths: a qualitative evidence synthesis.社区对土壤传播蠕虫群体药物给药的看法：定性证据综合分析

Cochrane Database Syst Rev. 2025 Jun 20;6:CD015794. doi: 10.1002/14651858.CD015794.pub2.

Adapting Safety Plans for Autistic Adults with Involvement from the Autism Community.在自闭症群体的参与下为成年自闭症患者调整安全计划。

Autism Adulthood. 2025 May 28;7(3):293-302. doi: 10.1089/aut.2023.0124. eCollection 2025 Jun.

The ultimate power play in research - partnering with patients, partnering with power.研究中的终极权力博弈——与患者合作，与权力合作。

Res Involv Engagem. 2025 Jun 17;11(1):65. doi: 10.1186/s40900-025-00745-9.

An Occupational Science Contribution to Camouflaging Scholarship: Centering Intersectional Experiences of Occupational Disruptions.职业科学对伪装学术的贡献：以职业中断的交叉经历为中心

Autism Adulthood. 2025 May 28;7(3):238-248. doi: 10.1089/aut.2023.0070. eCollection 2025 Jun.

Journals Operating Predatory Practices Are Systematically Eroding the Science Ethos: A Gate and Code Strategy to Minimise Their Operating Space and Restore Research Best Practice.采用掠夺性做法的期刊正在系统性地侵蚀科学精神：一种减少其运营空间并恢复研究最佳实践的把关与编码策略。

Microb Biotechnol. 2025 Jun;18(6):e70180. doi: 10.1111/1751-7915.70180.

Trust, Trustworthiness, and the Future of Medical AI: Outcomes of an Interdisciplinary Expert Workshop.信任、可信度与医学人工智能的未来：跨学科专家研讨会成果

J Med Internet Res. 2025 Jun 2;27:e71236. doi: 10.2196/71236.

Stakeholders' perceptions and experiences of factors influencing the commissioning, delivery, and uptake of general health checks: a qualitative evidence synthesis.利益相关者对影响一般健康检查的委托、提供和接受因素的看法与体验：一项定性证据综合分析

Cochrane Database Syst Rev. 2025 Mar 20;3(3):CD014796. doi: 10.1002/14651858.CD014796.pub2.

Integrating Antiracism and Life Course Frameworks in Research with Autistic Minority Transition-Aged Youth and Young Adults.在针对少数族裔自闭症过渡年龄青年和年轻人的研究中整合反种族主义和生命历程框架。

Autism Adulthood. 2025 May 28;7(3):229-237. doi: 10.1089/aut.2023.0088. eCollection 2025 Jun.

引用本文的文献

Integrating Artificial Intelligence in Next-Generation Sequencing: Advances, Challenges, and Future Directions.将人工智能整合到下一代测序中：进展、挑战与未来方向。

Curr Issues Mol Biol. 2025 Jun 19;47(6):470. doi: 10.3390/cimb47060470.

本文引用的文献

The ISCB competency framework v. 3: a revised and extended standard for bioinformatics education and training.国际计算生物学学会能力框架第3版：生物信息学教育与培训的修订和扩展标准

Bioinform Adv. 2024 Nov 18;4(1):vbae166. doi: 10.1093/bioadv/vbae166. eCollection 2024.

InterPro: the protein sequence classification resource in 2025.InterPro：2025年的蛋白质序列分类资源。

Nucleic Acids Res. 2025 Jan 6;53(D1):D444-D456. doi: 10.1093/nar/gkae1082.

Ten simple rules to make computable knowledge shareable and reusable.使可计算知识具有可分享性和可重用性的 10 条简单规则。

PLoS Comput Biol. 2024 Jun 20;20(6):e1012179. doi: 10.1371/journal.pcbi.1012179. eCollection 2024 Jun.

The impact of transitive annotation on the training of taxonomic classifiers.传递注释对分类学分类器训练的影响。

Front Microbiol. 2024 Jan 3;14:1240957. doi: 10.3389/fmicb.2023.1240957. eCollection 2023.

Grand challenges in bioinformatics education and training.生物信息学教育与培训中的重大挑战。

Nat Biotechnol. 2023 Aug;41(8):1171-1174. doi: 10.1038/s41587-023-01891-9.

Synthetic data in health care: A narrative review.医疗保健中的合成数据：一篇叙述性综述。

PLOS Digit Health. 2023 Jan 6;2(1):e0000082. doi: 10.1371/journal.pdig.0000082. eCollection 2023 Jan.

Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning.RCSB 蛋白质数据库（RCSB.org）：提供实验测定的 PDB 结构以及来自人工智能/机器学习的 100 万个蛋白质计算结构模型。

Nucleic Acids Res. 2023 Jan 6;51(D1):D488-D508. doi: 10.1093/nar/gkac1077.

GenBank 2023 update.GenBank 2023 更新。

Nucleic Acids Res. 2023 Jan 6;51(D1):D141-D144. doi: 10.1093/nar/gkac1012.

Propagation, detection and correction of errors using the sequence database network.利用序列数据库网络进行错误的传播、检测和纠正。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac416.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验