在药物发现领域发表神经网络可能会危及训练数据的隐私。

Publishing neural networks in drug discovery might compromise training data privacy.

作者信息

Krüger Fabian P, Östman Johan, Mervin Lewis, Tetko Igor V, Engkvist Ola

机构信息

Discovery Sciences, Molecular AI, AstraZeneca R&D, Mölndal, 431 83, Sweden.

TUM School of Computation, Information and Technology, Department of Mathematics, Technical University of Munich, Munich, 80333, Germany.

出版信息

J Cheminform. 2025 Mar 26;17(1):38. doi: 10.1186/s13321-025-00982-w.

DOI:10.1186/s13321-025-00982-w

PMID:40140934

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11948693/

Abstract

This study investigates the risks of exposing confidential chemical structures when machine learning models trained on these structures are made publicly available. We use membership inference attacks, a common method to assess privacy that is largely unexplored in the context of drug discovery, to examine neural networks for molecular property prediction in a black-box setting. Our results reveal significant privacy risks across all evaluated datasets and neural network architectures. Combining multiple attacks increases these risks. Molecules from minority classes, often the most valuable in drug discovery, are particularly vulnerable. We also found that representing molecules as graphs and using message-passing neural networks may mitigate these risks. We provide a framework to assess privacy risks of classification models and molecular representations, available at https://github.com/FabianKruger/molprivacy . Our findings highlight the need for careful consideration when sharing neural networks trained on proprietary chemical structures, informing organisations and researchers about the trade-offs between data confidentiality and model openness.

摘要

本研究调查了在公开提供基于这些结构训练的机器学习模型时，暴露机密化学结构的风险。我们使用成员推理攻击（一种在药物发现背景下基本未被探索的评估隐私的常用方法），在黑盒设置中检查用于分子性质预测的神经网络。我们的结果揭示了所有评估数据集和神经网络架构中存在的重大隐私风险。组合多种攻击会增加这些风险。少数类别的分子，通常是药物发现中最有价值的，特别容易受到攻击。我们还发现，将分子表示为图形并使用消息传递神经网络可能会减轻这些风险。我们提供了一个评估分类模型和分子表示隐私风险的框架，可在https://github.com/FabianKruger/molprivacy获取。我们的研究结果强调了在共享基于专有化学结构训练的神经网络时需要仔细考虑，让组织和研究人员了解数据机密性和模型开放性之间的权衡。