González Lastre Manuel, Pou Pablo, Wiche Miguel, Ebeling Daniel, Schirmeisen Andre, Pérez Rubén
Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049, Madrid, Spain.
Condensed Matter Physics Center (IFIMAC), Universidad Autónoma de Madrid, E-28049, Madrid, Spain.
J Cheminform. 2024 Nov 25;16(1):130. doi: 10.1186/s13321-024-00921-1.
Non-Contact Atomic Force Microscopy with CO-functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR-AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024-bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR-AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR-AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.Scientific contributionPrevious works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub-optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.
采用一氧化碳功能化金属探针的非接触式原子力显微镜(简称为高分辨率原子力显微镜,HR-AFM)能够以前所未有的分辨率揭示吸附在表面的单个分子的内部结构。此前的研究表明,深度学习(DL)模型可以从三维等高HR-AFM图像堆栈中检索编码的化学和结构信息,从而实现分子识别。在本研究中,我们通过使用一种成熟的基于拓扑指纹(即半径为2的1024位扩展连接性化学指纹,ECFP4)的分子结构描述方法克服了它们的局限性,该方法是为子结构和相似性搜索而开发的。ECFP提供了分子的局部结构信息,每一位都与分子内的特定子结构相关。我们的DL模型能够从三维HR-AFM堆栈中提取这种优化的结构描述符,并通过虚拟筛选,利用预测的ECFP4识别分子,在理论图像上的检索准确率达到95.4%。此外,与之前的DL模型不同,这种方法为每个候选分子赋予一个置信度分数,即塔尼莫托相似度,从而提供识别可靠性的信息。在构建过程中,分子中某个特定子结构出现的次数在哈希过程中丢失了,而哈希过程对于使其适用于机器学习应用是必要的。我们表明,可以用另一个从相同HR-AFM堆栈预测化学式的DL模型提供的全局信息来补充基于指纹的虚拟筛选,从而将识别准确率提高到97.6%。最后,我们对实验图像进行了有限的测试,在实际条件下应用该流程取得了有前景的结果。
科学贡献
此前从AFM图像进行分子识别的研究使用的化学描述符对人类来说直观,但对神经网络来说并非最优。我们提出了一种从AFM图像中提取ECFP4并通过虚拟筛选识别分子的新方法,超越了之前的最先进模型。