Hosseini Mohammad, Hong Spencer, Holmes Kristi, Wetterstrand Kris, Donohue Christopher, Amaral Luis A Nunes, Stoeger Thomas
Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
National Institutes of Health, Bethesda, MD, USA.
J Escience Librariansh. 2024 Mar;13(1). doi: 10.7191/jeslib.811. Epub 2024 Mar 5.
Understanding "how to optimize the production of scientific knowledge" is paramount to those who support scientific research-funders as well as research institutions-to the communities served, and to researchers. Structured archives can help all involved to learn what decisions and processes help or hinder the production of new knowledge. Using artificial intelligence (AI) and large language models (LLMs), we recently created the first structured digital representation of the historic archives of the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health. This work yielded a digital knowledge base of entities, topics, and documents that can be used to probe the inner workings of the Human Genome Project, a massive international public-private effort to sequence the human genome, and several of its offshoots like The Cancer Genome Atlas (TCGA) and the Encyclopedia of DNA Elements (ENCODE). The resulting knowledge base will be instrumental in understanding not only how the Human Genome Project and genomics research developed collaboratively, but also how scientific goals come to be formulated and evolve. Given the diverse and rich data used in this project, we evaluated the ethical implications of employing AI and LLMs to process and analyze this valuable archive. As the first computational investigation of the internal archives of a massive collaborative project with multiple funders and institutions, this study will inform future efforts to conduct similar investigations while also considering and minimizing ethical challenges. Our methodology and risk-mitigating measures could also inform future initiatives in developing standards for project planning, policymaking, enhancing transparency, and ensuring ethical utilization of artificial intelligence technologies and large language models in archive exploration.
对于那些支持科学研究的人——包括资助者、研究机构、所服务的社区以及研究人员来说,理解“如何优化科学知识的产出”至关重要。结构化档案可以帮助所有相关方了解哪些决策和流程有助于或阻碍新知识的产出。我们最近利用人工智能(AI)和大语言模型(LLM),创建了美国国立卫生研究院下属国家人类基因组研究所(NHGRI)历史档案的首个结构化数字表示。这项工作产生了一个关于实体、主题和文档的数字知识库,可用于探究人类基因组计划(一项大规模的国际公私合作项目,旨在对人类基因组进行测序)及其一些分支项目(如癌症基因组图谱(TCGA)和DNA元件百科全书(ENCODE))的内部运作情况。由此产生的知识库不仅有助于理解人类基因组计划和基因组学研究是如何协同发展的,还能帮助理解科学目标是如何形成和演变的。鉴于本项目使用了多样且丰富的数据,我们评估了使用人工智能和大语言模型来处理和分析这一宝贵档案所涉及的伦理问题。作为对一个有多个资助者和机构参与的大规模合作项目内部档案的首次计算研究,本研究将为未来开展类似调查提供参考,同时也会考虑并尽量减少伦理挑战。我们的方法和风险缓解措施还可为未来制定项目规划标准、政策制定、提高透明度以及确保在档案探索中对人工智能技术和大语言模型进行伦理利用的举措提供参考。