Gedeck Peter, Skolnik Suzanne, Rodde Stephane
Peter Gedeck LLC , 2309 Grove Avenue, Falls Church, Virginia 22046, United States.
Novartis Institute for Biomedical Research , 250 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
J Chem Inf Model. 2017 Aug 28;57(8):1847-1858. doi: 10.1021/acs.jcim.7b00315. Epub 2017 Jul 25.
It is widely understood that QSAR models greatly improve if more data are used. However, irrespective of model quality, once chemical structures diverge too far from the initial data set, the predictive performance of a model degrades quickly. To increase the applicability domain we need to increase the diversity of the training set. This can be achieved by combining data from diverse sources. Public data can be easily included; however, proprietary data may be more difficult to add due to intellectual property concerns. In this contribution, we will present a method for the collaborative development of linear regression models that addresses this problem. The method differs from other past approaches, because data are only shared in an aggregated form. This prohibits access to individual data points and therefore avoids the disclosure of confidential structural information. The final models are equivalent to models that were built with combined data sets.
人们普遍认为,如果使用更多数据,定量构效关系(QSAR)模型会有很大改进。然而,无论模型质量如何,一旦化学结构与初始数据集差异过大,模型的预测性能就会迅速下降。为了扩大适用范围,我们需要增加训练集的多样性。这可以通过合并来自不同来源的数据来实现。公共数据可以很容易地纳入;然而,由于知识产权问题,专有数据可能更难添加。在本论文中,我们将提出一种用于线性回归模型协同开发的方法,该方法能解决这个问题。该方法与以往的其他方法不同,因为数据仅以汇总形式共享。这禁止访问单个数据点,因此避免了机密结构信息的泄露。最终模型等同于使用合并数据集构建的模型。