Lin Run-Hsin, Lin Pinpin, Wang Chia-Chi, Tung Chun-Wei
Institute of Biotechnology and Pharmaceutical Research, National Health Research Institutes, Miaoli County, 35053, Taiwan.
Graduate Institute of Data Science, College of Management, Taipei Medical University, Taipei, 10675, Taiwan.
J Cheminform. 2024 Aug 2;16(1):91. doi: 10.1186/s13321-024-00891-4.
Data scarcity is one of the most critical issues impeding the development of prediction models for chemical effects. Multitask learning algorithms leveraging knowledge from relevant tasks showed potential for dealing with tasks with limited data. However, current multitask methods mainly focus on learning from datasets whose task labels are available for most of the training samples. Since datasets were generated for different purposes with distinct chemical spaces, the conventional multitask learning methods may not be suitable. This study presents a novel multitask learning method MTForestNet that can deal with data scarcity problems and learn from tasks with distinct chemical space. The MTForestNet consists of nodes of random forest classifiers organized in the form of a progressive network, where each node represents a random forest model learned from a specific task. To demonstrate the effectiveness of the MTForestNet, 48 zebrafish toxicity datasets were collected and utilized as an example. Among them, two tasks are very different from other tasks with only 1.3% common chemicals shared with other tasks. In an independent test, MTForestNet with a high area under the receiver operating characteristic curve (AUC) value of 0.911 provided superior performance over compared single-task and multitask methods. The overall toxicity derived from the developed models of zebrafish toxicity is well correlated with the experimentally determined overall toxicity. In addition, the outputs from the developed models of zebrafish toxicity can be utilized as features to boost the prediction of developmental toxicity. The developed models are effective for predicting zebrafish toxicity and the proposed MTForestNet is expected to be useful for tasks with distinct chemical space that can be applied in other tasks.Scieific contributionA novel multitask learning algorithm MTForestNet was proposed to address the challenges of developing models using datasets with distinct chemical space that is a common issue of cheminformatics tasks. As an example, zebrafish toxicity prediction models were developed using the proposed MTForestNet which provide superior performance over conventional single-task and multitask learning methods. In addition, the developed zebrafish toxicity prediction models can reduce animal testing.
数据稀缺是阻碍化学效应预测模型发展的最关键问题之一。利用相关任务知识的多任务学习算法显示出处理数据有限任务的潜力。然而,当前的多任务方法主要侧重于从大多数训练样本都有任务标签的数据集进行学习。由于数据集是为不同目的生成的,具有不同的化学空间,传统的多任务学习方法可能并不适用。本研究提出了一种新颖的多任务学习方法MTForestNet,它可以处理数据稀缺问题,并从具有不同化学空间的任务中进行学习。MTForestNet由以渐进网络形式组织的随机森林分类器节点组成,其中每个节点代表从特定任务学习到的随机森林模型。为了证明MTForestNet的有效性,收集并使用了48个斑马鱼毒性数据集作为示例。其中,有两个任务与其他任务非常不同,与其他任务仅共享1.3%的常见化学物质。在独立测试中,MTForestNet的接收器操作特征曲线(AUC)值高达0.911,比单任务和多任务方法具有更好的性能。斑马鱼毒性模型得出的总体毒性与实验确定的总体毒性高度相关。此外,斑马鱼毒性模型的输出可作为特征,以提高发育毒性的预测。所开发的模型对于预测斑马鱼毒性是有效的,并且所提出的MTForestNet有望用于具有不同化学空间的任务,可应用于其他任务。
科学贡献
提出了一种新颖的多任务学习算法MTForestNet,以应对使用具有不同化学空间的数据集开发模型的挑战,这是化学信息学任务的常见问题。例如,使用所提出的MTForestNet开发了斑马鱼毒性预测模型,该模型比传统的单任务和多任务学习方法具有更好的性能。此外,所开发的斑马鱼毒性预测模型可以减少动物实验。