Suppr超能文献

探索用于反应产率预测的BERT:评估词元化、分子表示和预训练数据增强的影响。

Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pretraining Data Augmentation.

作者信息

Krzyzanowski Adrian, Pickett Stephen D, Pogány Peter

机构信息

GSK Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K.

出版信息

J Chem Inf Model. 2025 May 12;65(9):4381-4402. doi: 10.1021/acs.jcim.5c00359. Epub 2025 May 1.

Abstract

Predicting reaction yields in synthetic chemistry remains a significant challenge. This study systematically evaluates the impact of tokenization, molecular representation, pretraining data, and adversarial training on a BERT-based model for yield prediction of Buchwald-Hartwig and Suzuki-Miyaura coupling reactions using publicly available HTE data sets. We demonstrate that molecular representation choice (SMILES, DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names) has minimal impact on model performance, while typically BPE and SentencePiece tokenization outperform other methods. WordPiece is strongly discouraged for SELFIES and fingerprint-based notation. Furthermore, pretraining with relatively small data sets (<100 K reactions) achieves comparable performance to larger data sets containing millions of examples. The use of artificially generated domain-specific pretraining data is proposed. The artificially generated sets prove to be a good surrogate for the reaction schemes extracted from reaction data sets such as Pistachio or Reaxys. The best performance was observed for hybrid pretraining sets combining the real and the domain-specific, artificial data. Finally, we show that a novel adversarial training approach, perturbing input embeddings dynamically, improves model robustness and generalizability for yield and reaction success prediction. These findings provide valuable insights for developing robust and practical machine learning models for yield prediction in synthetic chemistry. GSK's BERT training code base is made available to the community with this work.

摘要

预测合成化学中的反应产率仍然是一项重大挑战。本研究系统地评估了词元化、分子表示、预训练数据和对抗训练对基于BERT的模型的影响,该模型使用公开可用的高通量实验(HTE)数据集来预测布赫瓦尔德-哈特维希反应和铃木-宫浦反应的产率。我们证明,分子表示的选择(简化分子线性输入规范(SMILES)、深度简化分子线性输入规范(DeepSMILES)、自描述分子标识符(SELFIES)、基于摩根指纹的表示法、国际纯粹与应用化学联合会(IUPAC)名称)对模型性能的影响最小,而通常字节对编码(BPE)和句子Piece词元化优于其他方法。强烈不建议对SELFIES和基于指纹的表示法使用词块(WordPiece)。此外,使用相对较小的数据集(<100K个反应)进行预训练可获得与包含数百万个示例的较大数据集相当的性能。建议使用人工生成的特定领域预训练数据。事实证明,人工生成的数据集是从诸如开心果(Pistachio)或雷axy等反应数据集中提取的反应方案的良好替代物。对于结合真实数据和特定领域人工数据的混合预训练集,观察到了最佳性能。最后,我们表明,一种新颖的对抗训练方法,即动态扰动输入嵌入,可提高模型在产率和反应成功预测方面的鲁棒性和泛化能力。这些发现为开发用于合成化学产率预测的强大且实用的机器学习模型提供了有价值的见解。葛兰素史克(GSK)的BERT训练代码库随本研究向社区公开。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验