Alshehri Abdulelah S, Horstmann Kai A, You Fengqi
Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, New York 14853, United States.
Department of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia.
J Chem Inf Model. 2024 Aug 12;64(15):5888-5899. doi: 10.1021/acs.jcim.4c00816. Epub 2024 Jul 15.
Chemical information disseminated in scientific documents offers an untapped potential for deep learning-assisted insights and breakthroughs. Automated extraction efforts have shifted from resource-intensive manual extraction toward applying machine learning methods to streamline chemical data extraction. While current extraction models and pipelines have ushered in notable efficiency improvements, they often exhibit modest performance, compromising the accuracy of predictive models trained on extracted data. Further, current chemical pipelines lack both transferability─where a model trained on one task can be adapted to another relevant task with limited examples─and extensibility, which enables seamless adaptability for new extraction tasks. Addressing these gaps, we present ChemREL, a versatile chemical data extraction pipeline emphasizing performance, transferability, and extensibility. ChemREL utilizes a custom, diverse data set of chemical documents, labeled through an active learning strategy to extract two properties: normal melting point and lethal dose 50 (LD). The normal melting point is selected for its prevalence in diverse contexts and wider literature, serving as the foundation for pipeline training. In contrast, LD evaluates the pipeline's transferability to an unrelated property, underscoring variance in its biological nature, toxicological context, and units, among other differences. With pretraining and fine-tuning, our pipeline outperforms existing methods and GPT-4, achieving F1-scores of 96.1% for entity identification and 97.0% for relation mapping, culminating in an overall F1-score of 95.4%. More importantly, ChemREL displays high transferability, effectively transitioning from melting point extraction to LD extraction with 10 randomly selected training documents. Released as an open-source package, ChemREL aims to broaden access to chemical data extraction, enabling the construction of expansive relational data sets that propel discovery.
科学文献中传播的化学信息为深度学习辅助的见解和突破提供了未被开发的潜力。自动提取工作已从资源密集型的手动提取转向应用机器学习方法来简化化学数据提取。虽然当前的提取模型和管道显著提高了效率,但它们的性能往往一般,影响了基于提取数据训练的预测模型的准确性。此外,当前的化学管道既缺乏可迁移性(即在有限示例的情况下,在一个任务上训练的模型可以适应另一个相关任务),也缺乏可扩展性(即能够无缝适应新的提取任务)。为了弥补这些差距,我们提出了ChemREL,这是一个通用的化学数据提取管道,强调性能、可迁移性和可扩展性。ChemREL利用一个定制的、多样化的化学文献数据集,通过主动学习策略进行标注,以提取两个属性:正常熔点和半数致死剂量(LD)。选择正常熔点是因为它在不同背景和更广泛的文献中普遍存在,作为管道训练的基础。相比之下,LD评估管道对不相关属性的可迁移性,突出其生物学性质、毒理学背景和单位等方面的差异。通过预训练和微调,我们的管道优于现有方法和GPT-4,实体识别的F1分数达到96.1%,关系映射的F1分数达到97.0%,总体F1分数达到95.4%。更重要的是,ChemREL显示出高可迁移性,通过10个随机选择的训练文档有效地从熔点提取过渡到LD提取。作为一个开源包发布,ChemREL旨在扩大化学数据提取的可及性,促进构建推动发现的广泛关系数据集。