School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool L3 5UX, United Kingdom.
Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, United Kingdom.
Proc Natl Acad Sci U S A. 2021 Dec 7;118(49). doi: 10.1073/pnas.2108013118.
Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).
几乎所有的机器学习(ML)都是基于使用内在特征来表示示例的。当存在多个相关的 ML 问题(任务)时,可以通过首先在其他任务上训练 ML 模型,并让它们对新任务的每个示例进行预测,从而将这些特征转换为外在特征,从而产生新的表示。我们称之为转换式机器学习(TML)。TML 与迁移学习、多任务学习和堆叠非常相似,并且具有协同作用。TML 适用于改进任何非线性 ML 方法。我们使用非线性 ML 的最重要的三个类别:随机森林、梯度提升机、支持向量机、k 近邻和神经网络,对 TML 进行了测试。为了确保评估的通用性和稳健性,我们利用了来自三个科学领域的数千个 ML 问题:药物设计、预测基因表达和 ML 算法选择。我们发现 TML 显著提高了所有 ML 方法在所有领域的预测性能(平均提高 4%到 50%),并且 TML 特征通常优于内在特征。通过可解释性 ML,TML 的使用还增强了科学理解。在药物设计中,我们发现 TML 提供了有关药物靶标特异性、药物之间的关系以及靶蛋白之间的关系的深入了解。TML 导致了基于生态系统的 ML 方法,其中新的任务、示例、预测等协同作用以提高性能。为了为这个生态系统做出贡献,我们所有的数据、代码和我们的约 50000 个 ML 模型都使用 Findability、Accessibility、Interoperability 和 Reusability 原则进行了充分的元数据注释、链接和公开发布(约 100 Gbytes)。