Department of ECE, Northeastern University, Boston, Massachusetts, United States.
Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States.
PLoS Comput Biol. 2021 Oct 11;17(10):e1009433. doi: 10.1371/journal.pcbi.1009433. eCollection 2021 Oct.
Most predictive models based on gene expression data do not leverage information related to gene splicing, despite the fact that splicing is a fundamental feature of eukaryotic gene expression. Cigarette smoking is an important environmental risk factor for many diseases, and it has profound effects on gene expression. Using smoking status as a prediction target, we developed deep neural network predictive models using gene, exon, and isoform level quantifications from RNA sequencing data in 2,557 subjects in the COPDGene Study. We observed that models using exon and isoform quantifications clearly outperformed gene-level models when using data from 5 genes from a previously published prediction model. Whereas the test set performance of the previously published model was 0.82 in the original publication, our exon-based models including an exon-to-isoform mapping layer achieved a test set AUC (area under the receiver operating characteristic) of 0.88, which improved to an AUC of 0.94 using exon quantifications from a larger set of genes. Isoform variability is an important source of latent information in RNA-seq data that can be used to improve clinical prediction models.
大多数基于基因表达数据的预测模型并未利用与基因剪接相关的信息,尽管剪接是真核基因表达的基本特征。吸烟是许多疾病的重要环境风险因素,它对基因表达有深远的影响。我们使用来自 COPDGene 研究中 2557 名受试者的 RNA 测序数据中的基因、外显子和异构体水平定量值,以吸烟状态作为预测目标,开发了深度神经网络预测模型。我们观察到,当使用来自先前发表的预测模型的 5 个基因的数据时,使用外显子和异构体定量值的模型明显优于基因水平模型。而先前发表模型的测试集性能在原始出版物中为 0.82,我们基于外显子的模型包括一个外显子到异构体映射层,测试集 AUC(接受者操作特征曲线下的面积)达到 0.88,使用更大基因集的外显子定量值将 AUC 提高到 0.94。异构体变异性是 RNA-seq 数据中潜在信息的重要来源,可用于改进临床预测模型。