基于人工智能的基因组学和用于高通量筛选研究的自动显微镜图像分析中的数据管理与整理实践：推动可靠且符合伦理的人工智能应用。

Data stewardship and curation practices in AI-based genomics and automated microscopy image analysis for high-throughput screening studies: promoting robust and ethical AI applications.

作者信息

Taddese Asefa Adimasu, Addis Assefa Chekole, Tam Bjorn T

机构信息

Academy of Wellness and Human Development, Faculty of Arts and Social Sciences, Hong Kong Baptist University, Hong Kong SAR, China.

Department of Information Science, College of Informatics, University of Gondar, Gondar, Ethiopia.

出版信息

Hum Genomics. 2025 Feb 23;19(1):16. doi: 10.1186/s40246-025-00716-x.

DOI:10.1186/s40246-025-00716-x

PMID:39988670

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11849233/

Abstract

BACKGROUND

Researchers have increasingly adopted AI and next-generation sequencing (NGS), revolutionizing genomics and high-throughput screening (HTS), and transforming our understanding of cellular processes and disease mechanisms. However, these advancements generate vast datasets requiring effective data stewardship and curation practices to maintain data integrity, privacy, and accessibility. This review consolidates existing knowledge on key aspects, including data governance, quality management, privacy measures, ownership, access control, accountability, traceability, curation frameworks, and storage systems.

METHODS

We conducted a systematic literature search up to January 10, 2024, across PubMed, MEDLINE, EMBASE, Scopus, and additional scholarly platforms to examine recent advances and challenges in managing the vast and complex datasets generated by these technologies. Our search strategy employed structured keyword queries focused on four key thematic areas: data governance and management, curation frameworks, algorithmic bias and fairness, and data storage, all within the context of AI applications in genomics and microscopy. Using a realist synthesis methodology, we integrated insights from diverse frameworks to explore the multifaceted challenges associated with data stewardship in these domains. Three independent reviewers, who systematically categorized the information across critical themes, including data governance, quality management, security, privacy, ownership, and access control conducted data extraction and analysis. The study also examined specific AI considerations, such as algorithmic bias, model explainability, and the application of advanced cryptographic techniques. The review process included six stages, starting with an extensive search across multiple research databases, resulting in 273 documents. Screening based on broad criteria, titles, abstracts, and full texts followed this, narrowing the pool to 38 highly relevant citations.

RESULTS

Our findings indicated that significant research was conducted in 2023 by highlighting the increasing recognition of robust data governance frameworks in AI-driven genomics and microscopy. While 36 articles extensively discussed data interoperability and sharing, AI-model explain ability and data augmentation remained underexplored, indicating significant gaps. The integration of diverse data types-ranging from sequencing and clinical data to proteomic and imaging data-highlighted the complexity and expansive scope of AI applications in these fields. The current challenges identified in AI-based data stewardship and curation practices are lack of infrastructure and cost optimization, ethical and privacy considerations, access control and sharing mechanisms, large scale data handling and analysis and transparent data-sharing policies and practice. Proposed solutions to address issues related to data quality, privacy, and bias management include advanced cryptographic techniques, federated learning, and blockchain technology. Robust data governance measures, such as GA4GH standards, DUO versioning, and attribute-based access control, are essential for ensuring data integrity, security, and ethical use. The study also emphasized the critical role of Data Management Plans (DMPs), meticulous metadata curation, and advanced cryptographic techniques in mitigating risks related to data security and identifiability. Despite advancements, significant challenges persisted in balancing data ownership with research accessibility, integrating heterogeneous data sources, ensuring platform interoperability, and maintaining data quality. Ongoing risks of unauthorized access and data breaches underscored the need for continuous innovation in data management practices and stricter adherence to legal and ethical standards.

CONCLUSIONS

These findings explored the current practices and challenges in data stewardship, offering a roadmap for strengthening the governance, security, and ethical use of AI in genomics and microscopy. While robust governance frameworks and ethical practices have established a foundation for data integrity and transparency, there remains an urgent need for collaborative efforts to develop interoperable platforms and transparent data-sharing policies. Additionally, evolving legal and ethical frameworks will be crucial to addressing emerging challenges posed by AI technologies. Fostering transparency, accountability, and ethical responsibility within the research community will be key to ensuring trust and driving ethically sound scientific advancements.

摘要

背景

研究人员越来越多地采用人工智能和下一代测序（NGS）技术，这给基因组学和高通量筛选（HTS）带来了变革，改变了我们对细胞过程和疾病机制的理解。然而，这些进展产生了大量数据集，需要有效的数据管理和整理实践来维护数据的完整性、隐私性和可访问性。本综述整合了关于关键方面的现有知识，包括数据治理、质量管理、隐私措施、所有权、访问控制、问责制、可追溯性、整理框架和存储系统。

方法

我们在2024年1月10日前，对PubMed、MEDLINE、EMBASE、Scopus和其他学术平台进行了系统的文献检索，以研究这些技术产生的海量复杂数据集管理方面的最新进展和挑战。我们的检索策略采用结构化关键词查询，重点关注四个关键主题领域：数据治理与管理、整理框架、算法偏差与公平性以及数据存储，所有这些都在基因组学和显微镜学中的人工智能应用背景下进行。使用现实主义综合方法，我们整合了来自不同框架的见解，以探索这些领域中与数据管理相关的多方面挑战。三位独立评审员系统地对包括数据治理、质量管理、安全、隐私、所有权和访问控制等关键主题的信息进行了分类，并进行了数据提取和分析。该研究还考察了特定的人工智能考量因素，如算法偏差、模型可解释性以及先进加密技术的应用。综述过程包括六个阶段，首先在多个研究数据库中进行广泛搜索，得到273篇文献。随后基于宽泛标准、标题、摘要和全文进行筛选，将范围缩小至38篇高度相关的文献。

结果

我们的研究结果表明，2023年开展了大量研究，突出了在人工智能驱动的基因组学和显微镜学中对强大数据治理框架的日益认可。虽然36篇文章广泛讨论了数据互操作性和共享，但人工智能模型的可解释性和数据增强方面仍未得到充分探索，表明存在重大差距。从测序和临床数据到蛋白质组学和成像数据等多种数据类型的整合，凸显了人工智能在这些领域应用的复杂性和广泛范围。当前在基于人工智能的数据管理和整理实践中发现的挑战包括缺乏基础设施和成本优化、伦理和隐私考量、访问控制和共享机制、大规模数据处理和分析以及透明的数据共享政策和实践。针对数据质量、隐私和偏差管理相关问题提出的解决方案包括先进加密技术、联邦学习和区块链技术。强大的数据治理措施，如GA4GH标准、DUO版本控制和基于属性的访问控制，对于确保数据完整性、安全性和道德使用至关重要。该研究还强调了数据管理计划（DMPs）、细致的元数据整理和先进加密技术在降低与数据安全和可识别性相关风险方面的关键作用。尽管取得了进展，但在平衡数据所有权与研究可访问性、整合异构数据源、确保平台互操作性以及维持数据质量方面仍存在重大挑战。未经授权访问和数据泄露的持续风险凸显了在数据管理实践中持续创新以及更严格遵守法律和道德标准的必要性。

结论

这些研究结果探讨了数据管理中的当前实践和挑战，为加强人工智能在基因组学和显微镜学中的治理、安全和道德使用提供了路线图。虽然强大的治理框架和道德实践为数据完整性和透明度奠定了基础，但仍迫切需要共同努力开发可互操作的平台和透明的数据共享政策。此外，不断发展的法律和道德框架对于应对人工智能技术带来的新挑战至关重要。在研究社区中促进透明度、问责制和道德责任将是确保信任并推动符合道德的科学进步的关键。