基于基因表达的非线性信号在分类问题中的作用。

The effect of non-linear signal in classification problems using gene expression.

机构信息

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Pennsylvania, United States of America.

Department of Pharmacology, University of Colorado School of Medicine, Colorado, United States of America.

出版信息

PLoS Comput Biol. 2023 Mar 27;19(3):e1010984. doi: 10.1371/journal.pcbi.1010984. eCollection 2023 Mar.

DOI:10.1371/journal.pcbi.1010984

PMID:36972227

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10079219/

Abstract

Those building predictive models from transcriptomic data are faced with two conflicting perspectives. The first, based on the inherent high dimensionality of biological systems, supposes that complex non-linear models such as neural networks will better match complex biological systems. The second, imagining that complex systems will still be well predicted by simple dividing lines prefers linear models that are easier to interpret. We compare multi-layer neural networks and logistic regression across multiple prediction tasks on GTEx and Recount3 datasets and find evidence in favor of both possibilities. We verified the presence of non-linear signal when predicting tissue and metadata sex labels from expression data by removing the predictive linear signal with Limma, and showed the removal ablated the performance of linear methods but not non-linear ones. However, we also found that the presence of non-linear signal was not necessarily sufficient for neural networks to outperform logistic regression. Our results demonstrate that while multi-layer neural networks may be useful for making predictions from gene expression data, including a linear baseline model is critical because while biological systems are high-dimensional, effective dividing lines for predictive models may not be.

摘要

那些从转录组数据中构建预测模型的人面临着两种相互冲突的观点。第一种观点基于生物系统固有的高度复杂性，假设复杂的非线性模型（如神经网络）将更好地匹配复杂的生物系统。第二种观点则认为，即使是复杂的系统，也可以通过简单的划分线来很好地预测，因此更喜欢易于解释的线性模型。我们在 GTEx 和 Recount3 数据集上的多个预测任务中比较了多层神经网络和逻辑回归，并找到了两种可能性都成立的证据。我们通过 Limma 去除表达数据中预测组织和元数据性别标签的线性信号，验证了非线性信号的存在，并表明去除线性信号会削弱线性方法的性能，但不会削弱非线性方法的性能。然而，我们还发现，即使存在非线性信号，神经网络也不一定能胜过逻辑回归。我们的研究结果表明，虽然多层神经网络可能有助于从基因表达数据中进行预测，但包括线性基线模型至关重要，因为尽管生物系统具有高度复杂性，但对于预测模型来说，有效的分界线可能并不存在。