通过语言模型对用于钙钛矿合成的更安全溶剂进行基于不确定性的筛选

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models.

作者信息

Mukherjee Arpan, Giri Deepesh, Rajan Krishna

机构信息

Department of Materials Design and Innovation, University at Buffalo, Buffalo, New York 14260-1660, United States.

Laurel Ridge Community College, Middletown, Virginia 22645, United States.

出版信息

J Chem Inf Model. 2025 Aug 11;65(15):7901-7918. doi: 10.1021/acs.jcim.5c00612. Epub 2025 Jul 22.

DOI:10.1021/acs.jcim.5c00612

PMID:40694668

Abstract

Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect ("hallucinated") information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recall─a trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.

摘要

对于小众科学主题而言，自动数据整理面临着重大挑战，因为在这些领域数据质量和上下文准确性至关重要。像BERT和ELMo这样的双向上下文模型在上下文理解和确定性方面表现出色。然而，它们受到训练语料库较窄以及无法在碎片化或稀疏上下文中综合信息的限制。相反，像GPT这样的自回归生成模型可以通过利用更广泛的上下文知识来综合分散的信息，但经常会生成看似合理但错误（“幻觉”）的信息。为了解决这些互补性限制，我们提出了一种集成方法，将BERT/ELMo的确定性精度与GPT的上下文深度相结合。我们开发了一个分层知识提取框架，以识别钙钛矿合成中的钙钛矿及其相关溶剂，使用两种互补方法从广泛主题逐步深入到更详细的细节。第一种方法利用像BERT/ELMo这样的确定性模型进行精确实体提取，而第二种方法使用GPT进行更广泛的上下文综合和概括。两种方法的输出都通过结构匹配和实体归一化进行验证，确保一致性和可追溯性。由于该领域缺乏基准数据集，我们留出一部分论文进行人工验证，作为调整实体归一化规则的参考集。这使得能够对模型精度、召回率和结构依从性进行定量评估，同时也为模型置信度提供了有根据的估计。通过交叉两种方法的输出，我们生成了一个具有最大置信度的溶剂列表，将精度与上下文深度相结合，以确保准确性和可靠性。这种方法以召回率为代价提高了精度——鉴于在高可信度的科学应用中，尽量减少幻觉通常比实现全面覆盖更关键，尤其是当下游可靠性至关重要时，我们接受这种权衡。作为一个案例研究，整理后的数据集用于使用预训练的深度学习模型预测溶剂的内分泌干扰（ED）潜力。认识到机器学习模型可能没有在诸如钙钛矿相关溶剂这样的小众数据集上进行训练，我们使用香农熵对认知不确定性进行了量化。这种度量评估了机器学习模型预测的置信度，独立于基于自然语言处理的数据整理过程中的不确定性，并识别出需要进一步验证的高风险溶剂。此外，人工验证管道解决了围绕人工智能整理数据集中的信任、结构和透明度的伦理问题。

相似文献

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models.通过语言模型对用于钙钛矿合成的更安全溶剂进行基于不确定性的筛选

J Chem Inf Model. 2025 Aug 11;65(15):7901-7918. doi: 10.1021/acs.jcim.5c00612. Epub 2025 Jul 22.

Short-Term Memory Impairment短期记忆障碍

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Automated devices for identifying peripheral arterial disease in people with leg ulceration: an evidence synthesis and cost-effectiveness analysis.用于识别下肢溃疡患者外周动脉疾病的自动化设备：证据综合和成本效益分析。

Health Technol Assess. 2024 Aug;28(37):1-158. doi: 10.3310/TWCG3912.

Interventions to improve safe and effective medicines use by consumers: an overview of systematic reviews.改善消费者安全有效用药的干预措施：系统评价概述

Cochrane Database Syst Rev. 2014 Apr 29;2014(4):CD007768. doi: 10.1002/14651858.CD007768.pub3.

Sexual Harassment and Prevention Training性骚扰与预防培训

Perceptions and experiences of the prevention, detection, and management of postpartum haemorrhage: a qualitative evidence synthesis.预防、检测和管理产后出血的认知和经验：定性证据综合。

Cochrane Database Syst Rev. 2023 Nov 27;11(11):CD013795. doi: 10.1002/14651858.CD013795.pub2.

Factors that impact on the use of mechanical ventilation weaning protocols in critically ill adults and children: a qualitative evidence-synthesis.影响重症成人和儿童机械通气撤机方案使用的因素：一项定性证据综合分析

Cochrane Database Syst Rev. 2016 Oct 4;10(10):CD011812. doi: 10.1002/14651858.CD011812.pub2.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

通过语言模型对用于钙钛矿合成的更安全溶剂进行基于不确定性的筛选

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models.

作者信息

Mukherjee Arpan, Giri Deepesh, Rajan Krishna

机构信息

Department of Materials Design and Innovation, University at Buffalo, Buffalo, New York 14260-1660, United States.

Laurel Ridge Community College, Middletown, Virginia 22645, United States.

出版信息

J Chem Inf Model. 2025 Aug 11;65(15):7901-7918. doi: 10.1021/acs.jcim.5c00612. Epub 2025 Jul 22.

DOI:10.1021/acs.jcim.5c00612

PMID:40694668

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过语言模型对用于钙钛矿合成的更安全溶剂进行基于不确定性的筛选

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models.

作者信息

机构信息

出版信息

相似文献

通过语言模型对用于钙钛矿合成的更安全溶剂进行基于不确定性的筛选

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models.

作者信息

机构信息

出版信息

相似文献