Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i76-i85. doi: 10.1093/bioinformatics/btad204.
The size of available omics datasets is steadily increasing with technological advancement in recent years. While this increase in sample size can be used to improve the performance of relevant prediction tasks in healthcare, models that are optimized for large datasets usually operate as black boxes. In high-stakes scenarios, like healthcare, using a black-box model poses safety and security issues. Without an explanation about molecular factors and phenotypes that affected the prediction, healthcare providers are left with no choice but to blindly trust the models. We propose a new type of artificial neural network, named Convolutional Omics Kernel Network (COmic). By combining convolutional kernel networks with pathway-induced kernels, our method enables robust and interpretable end-to-end learning on omics datasets ranging in size from a few hundred to several hundreds of thousands of samples. Furthermore, COmic can be easily adapted to utilize multiomics data.
We evaluated the performance capabilities of COmic on six different breast cancer cohorts. Additionally, we trained COmic models on multiomics data using the METABRIC cohort. Our models performed either better or similar to competitors on both tasks. We show how the use of pathway-induced Laplacian kernels opens the black-box nature of neural networks and results in intrinsically interpretable models that eliminate the need for post hoc explanation models.
Datasets, labels, and pathway-induced graph Laplacians used for the single-omics tasks can be downloaded at https://ibm.ent.box.com/s/ac2ilhyn7xjj27r0xiwtom4crccuobst/folder/48027287036. While datasets and graph Laplacians for the METABRIC cohort can be downloaded from the above mentioned repository, the labels have to be downloaded from cBioPortal at https://www.cbioportal.org/study/clinicalData?id=brca\_metabric. COmic source code as well as all scripts necessary to reproduce the experiments and analysis are publicly available at https://github.com/jditz/comics.
近年来,随着技术的进步,可供使用的组学数据集的规模在稳步增长。虽然这种样本量的增加可以用于提高医疗保健相关预测任务的性能,但针对大数据集优化的模型通常作为黑盒运行。在高风险的情况下,如医疗保健,使用黑盒模型会带来安全和保障问题。由于缺乏关于影响预测的分子因素和表型的解释,医疗保健提供者别无选择,只能盲目信任这些模型。我们提出了一种新型的人工神经网络,称为卷积组学核网络(COmic)。通过将卷积核网络与通路诱导核相结合,我们的方法能够在大小从几百到几十万样本的组学数据上进行稳健且可解释的端到端学习。此外,COmic 可以很容易地适应利用多组学数据。
我们在六个不同的乳腺癌队列上评估了 COmic 的性能能力。此外,我们还使用 METABRIC 队列对多组学数据进行了 COmic 模型训练。我们的模型在这两个任务上的表现都优于或与竞争对手相当。我们展示了如何使用通路诱导的拉普拉斯核打开神经网络的黑盒性质,并产生内在可解释的模型,从而消除了对事后解释模型的需求。
用于单组学任务的数据集、标签和通路诱导图拉普拉斯可以从 https://ibm.ent.box.com/s/ac2ilhyn7xjj27r0xiwtom4crccuobst/folder/48027287036 下载。METABRIC 队列的数据集和图拉普拉斯可以从上述存储库下载,标签必须从 https://www.cbioportal.org/study/clinicalData?id=brca_metabric 从 cBioPortal 下载。COmic 源代码以及重现实验和分析所需的所有脚本都可以在 https://github.com/jditz/comics 上公开获取。