Mukherjee Arpan, Giri Deepesh, Rajan Krishna
Department of Materials Design and Innovation, University at Buffalo, Buffalo, New York 14260-1660, United States.
Laurel Ridge Community College, Middletown, Virginia 22645, United States.
J Chem Inf Model. 2025 Aug 11;65(15):7901-7918. doi: 10.1021/acs.jcim.5c00612. Epub 2025 Jul 22.
Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect ("hallucinated") information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recall─a trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.
对于小众科学主题而言,自动数据整理面临着重大挑战,因为在这些领域数据质量和上下文准确性至关重要。像BERT和ELMo这样的双向上下文模型在上下文理解和确定性方面表现出色。然而,它们受到训练语料库较窄以及无法在碎片化或稀疏上下文中综合信息的限制。相反,像GPT这样的自回归生成模型可以通过利用更广泛的上下文知识来综合分散的信息,但经常会生成看似合理但错误(“幻觉”)的信息。为了解决这些互补性限制,我们提出了一种集成方法,将BERT/ELMo的确定性精度与GPT的上下文深度相结合。我们开发了一个分层知识提取框架,以识别钙钛矿合成中的钙钛矿及其相关溶剂,使用两种互补方法从广泛主题逐步深入到更详细的细节。第一种方法利用像BERT/ELMo这样的确定性模型进行精确实体提取,而第二种方法使用GPT进行更广泛的上下文综合和概括。两种方法的输出都通过结构匹配和实体归一化进行验证,确保一致性和可追溯性。由于该领域缺乏基准数据集,我们留出一部分论文进行人工验证,作为调整实体归一化规则的参考集。这使得能够对模型精度、召回率和结构依从性进行定量评估,同时也为模型置信度提供了有根据的估计。通过交叉两种方法的输出,我们生成了一个具有最大置信度的溶剂列表,将精度与上下文深度相结合,以确保准确性和可靠性。这种方法以召回率为代价提高了精度——鉴于在高可信度的科学应用中,尽量减少幻觉通常比实现全面覆盖更关键,尤其是当下游可靠性至关重要时,我们接受这种权衡。作为一个案例研究,整理后的数据集用于使用预训练的深度学习模型预测溶剂的内分泌干扰(ED)潜力。认识到机器学习模型可能没有在诸如钙钛矿相关溶剂这样的小众数据集上进行训练,我们使用香农熵对认知不确定性进行了量化。这种度量评估了机器学习模型预测的置信度,独立于基于自然语言处理的数据整理过程中的不确定性,并识别出需要进一步验证的高风险溶剂。此外,人工验证管道解决了围绕人工智能整理数据集中的信任、结构和透明度的伦理问题。