用于联邦隐私保护机器学习的化学结构数据集拆分

Splitting chemical structure data sets for federated privacy-preserving machine learning.

作者信息

Simm Jaak, Humbeck Lina, Zalewski Adam, Sturm Noe, Heyndrickx Wouter, Moreau Yves, Beck Bernd, Schuffenhauer Ansgar

机构信息

KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium.

Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany.

出版信息

J Cheminform. 2021 Dec 7;13(1):96. doi: 10.1186/s13321-021-00576-2.

DOI:10.1186/s13321-021-00576-2

PMID:34876230

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8650276/

Abstract

With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

摘要

随着机器学习方法在药物设计及相关领域的应用不断增加，设计合理测试集的挑战变得越来越突出。这一挑战的目标是在训练集、验证集和测试集之间对化学结构（化合物）进行合理划分，以便测试集上的性能对于推断预期应用中的性能具有意义。这个挑战本身就非常有趣且具有相关性，但在联邦机器学习方法中更为复杂，在这种方法中，多个合作伙伴在隐私保护条件下联合训练模型，不同参与方之间不得共享化学结构。在这项工作中，我们讨论了三种适用于联邦隐私保护设置的数据集划分方法，即：a. 局部敏感哈希（Locality-Sensitive Hashing，LSH）；b. 球排除聚类；c. 基于支架的装箱（支架网络）。为了评估这些划分方法，我们考虑以下质量标准（与随机划分相比）：预测性能偏差、分类标签和数据不平衡、测试集与训练集化合物之间的相似性距离。本文的主要发现是：a. 球排除聚类和基于支架的装箱都能实现数据集的高质量划分；b. 在联邦隐私保护设置的情况下，就计算成本而言，球排除聚类非常昂贵。

相似文献

Splitting chemical structure data sets for federated privacy-preserving machine learning.用于联邦隐私保护机器学习的化学结构数据集拆分

J Cheminform. 2021 Dec 7;13(1):96. doi: 10.1186/s13321-021-00576-2.

Federated personalized random forest for human activity recognition.联邦个性化随机森林的人体活动识别。

Math Biosci Eng. 2022 Jan;19(1):953-971. doi: 10.3934/mbe.2022044. Epub 2021 Nov 22.

FedSPL: federated self-paced learning for privacy-preserving disease diagnosis.FedSPL：用于保护隐私的疾病诊断的联邦自步学习。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab498.

Privacy-preserving federated machine learning on FAIR health data: A real-world application.公平健康数据上的隐私保护联邦机器学习：一个实际应用

Comput Struct Biotechnol J. 2024 Feb 17;24:136-145. doi: 10.1016/j.csbj.2024.02.014. eCollection 2024 Dec.

Analysis of Privacy Preservation Enhancements in Federated Learning Frameworks联邦学习框架中隐私保护增强措施分析

Privacy-Preserving Federated Survival Support Vector Machines for Cross-Institutional Time-To-Event Analysis: Algorithm Development and Validation.用于跨机构事件发生时间分析的隐私保护联合生存支持向量机：算法开发与验证

JMIR AI. 2024 Mar 29;3:e47652. doi: 10.2196/47652.

Privacy-Preserving Federated Model Predicting Bipolar Transition in Patients With Depression: Prediction Model Development Study.隐私保护的联邦模型预测抑郁症患者的双相情感障碍转变：预测模型开发研究。

J Med Internet Res. 2023 Jul 20;25:e46165. doi: 10.2196/46165.

FeARH: Federated machine learning with anonymous random hybridization on electronic medical records.FeARH：基于电子病历的匿名随机混合联邦机器学习

J Biomed Inform. 2021 May;117:103735. doi: 10.1016/j.jbi.2021.103735. Epub 2021 Mar 9.

Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis.联邦环境下的隐私保护患者相似度学习：开发与分析

JMIR Med Inform. 2018 Apr 13;6(2):e20. doi: 10.2196/medinform.7744.

Gestational weight gain prediction using privacy preserving federated learning.使用隐私保护联邦学习进行妊娠体重增加预测。

Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:2170-2174. doi: 10.1109/EMBC46164.2021.9630505.

引用本文的文献

The development and validation of a privacy-preserving model based on federated learning for diagnosing severe pediatric pneumonia.基于联邦学习的用于诊断小儿重症肺炎的隐私保护模型的开发与验证

Transl Pediatr. 2025 Jun 27;14(6):1287-1295. doi: 10.21037/tp-2025-349. Epub 2025 Jun 25.

Machine learning prediction of intestinal α-glucosidase inhibitors using a diverse set of ligands: a drug repurposing effort with drugBank database screening.使用多种配体对肠道α-葡萄糖苷酶抑制剂进行机器学习预测：基于DrugBank数据库筛选的药物再利用研究

In Silico Pharmacol. 2025 Jun 25;13(2):95. doi: 10.1007/s40203-025-00384-8. eCollection 2025.

Predicting Pharmacokinetics in Rats Using Machine Learning: A Comparative Study Between Empirical, Compartmental, and PBPK-Based Approaches.使用机器学习预测大鼠体内的药代动力学：经验性、房室模型和基于生理药代动力学模型方法的比较研究

Clin Transl Sci. 2025 Mar;18(3):e70150. doi: 10.1111/cts.70150.

Predicting cell morphological responses to perturbations using generative modeling.使用生成模型预测细胞对扰动的形态学反应。

Nat Commun. 2025 Jan 8;16(1):505. doi: 10.1038/s41467-024-55707-8.

Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization.用于分子的图神经过程：对接分数评估及提高泛化能力的策略

J Cheminform. 2024 Oct 23;16(1):115. doi: 10.1186/s13321-024-00904-2.

Yves Moreau has received the 2023 Einstein Foundation Individual Award for Promoting Quality in Research.伊夫·莫罗荣获2023年爱因斯坦基金会促进研究质量个人奖。

Bioinform Adv. 2024 Mar 29;4(1):vbae039. doi: 10.1093/bioadv/vbae039. eCollection 2024.

A benchmark dataset for machine learning in ecotoxicology.用于生态毒理学机器学习的基准数据集。

Sci Data. 2023 Oct 18;10(1):718. doi: 10.1038/s41597-023-02612-2.

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information.美乐蒂：在前所未有的规模上进行跨制药公司联邦学习，在不损害专有信息的情况下，实现 QSAR 的优势。

J Chem Inf Model. 2024 Apr 8;64(7):2331-2344. doi: 10.1021/acs.jcim.3c00799. Epub 2023 Aug 29.

Computational workflow for discovering small molecular binders for shallow binding sites by integrating molecular dynamics simulation, pharmacophore modeling, and machine learning: STAT3 as case study.通过整合分子动力学模拟、药效团建模和机器学习发现浅结合位点小分子结合剂的计算工作流程：以STAT3为例

J Comput Aided Mol Des. 2023 Dec;37(12):659-678. doi: 10.1007/s10822-023-00528-y. Epub 2023 Aug 19.

Characterizing Uncertainty in Machine Learning for Chemistry.机器学习在化学中的不确定性描述。

J Chem Inf Model. 2023 Jul 10;63(13):4012-4029. doi: 10.1021/acs.jcim.3c00373. Epub 2023 Jun 20.

本文引用的文献

Evolution of Novartis' Small Molecule Screening Deck Design.诺华小分子筛选库设计的演变。

J Med Chem. 2020 Dec 10;63(23):14425-14447. doi: 10.1021/acs.jmedchem.0c01332. Epub 2020 Nov 3.

rdScaffoldNetwork: The Scaffold Network Implementation in RDKit.rdScaffoldNetwork：RDKit 中的支架网络实现。

J Chem Inf Model. 2020 Jul 27;60(7):3331-3335. doi: 10.1021/acs.jcim.0c00296. Epub 2020 Jul 7.

Analyzing Learned Molecular Representations for Property Prediction.分析用于性质预测的学习分子表示。

J Chem Inf Model. 2019 Aug 26;59(8):3370-3388. doi: 10.1021/acs.jcim.9b00237. Epub 2019 Aug 13.

Scaffold Hunter: a comprehensive visual analytics framework for drug discovery.支架猎手：一个用于药物发现的综合可视化分析框架。

J Cheminform. 2017 May 11;9(1):28. doi: 10.1186/s13321-017-0213-3.

Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration ICs for Realistically Novel Compounds.Profile-QSAR 2.0：激酶虚拟筛选准确性与针对实际新型化合物的四浓度IC50相当。

J Chem Inf Model. 2017 Aug 28;57(8):2077-2088. doi: 10.1021/acs.jcim.7b00166. Epub 2017 Jul 26.

The ChEMBL database in 2017.2017年的ChEMBL数据库。

Nucleic Acids Res. 2017 Jan 4;45(D1):D945-D954. doi: 10.1093/nar/gkw1074. Epub 2016 Nov 28.

Time-split cross-validation as a method for estimating the goodness of prospective prediction.时间分割交叉验证作为一种估计前瞻性预测准确性的方法。

J Chem Inf Model. 2013 Apr 22;53(4):783-90. doi: 10.1021/ci400084k. Epub 2013 Apr 5.

Mining for bioactive scaffolds with scaffold networks: improved compound set enrichment from primary screening data.基于支架网络的生物活性支架挖掘：从初步筛选数据中提高化合物集的富集度。

J Chem Inf Model. 2011 Jul 25;51(7):1528-38. doi: 10.1021/ci2000924. Epub 2011 Jun 15.

Rendezvous in chemical space? Comparing the small molecule compound libraries of Bayer and Schering.在化学空间会合？比较拜耳和先灵的小分子化合物库。

Drug Discov Today. 2011 Jul;16(13-14):636-41. doi: 10.1016/j.drudis.2011.04.005. Epub 2011 Apr 30.

Extended-connectivity fingerprints.扩展连接指纹。

J Chem Inf Model. 2010 May 24;50(5):742-54. doi: 10.1021/ci100050t.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于联邦隐私保护机器学习的化学结构数据集拆分

Splitting chemical structure data sets for federated privacy-preserving machine learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献