Suppr超能文献

使用自动编码器从高度稀疏的二元基因型数据中高效提取特征以进行癌症预后预测。

Efficient feature extraction from highly sparse binary genotype data for cancer prognosis prediction using an auto-encoder.

作者信息

Shen Junjie, Li Huijun, Yu Xinghao, Bai Lu, Dong Yongfei, Cao Jianping, Lu Ke, Tang Zaixiang

机构信息

Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China.

Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China.

出版信息

Front Oncol. 2023 Jan 10;12:1091767. doi: 10.3389/fonc.2022.1091767. eCollection 2022.

Abstract

Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is how to integrate highly sparse genetic genomics data with a mass of minor effects into a prediction model for improving prediction power. We find that the deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower-dimensional continuous data in a non-linear way. This may provide benefits in risk prediction-associated genotype data. We developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for cancer prognosis. Specifically, we first reduced the size of binary biomarkers a univariable regression model to a moderate size. Then, a trainable auto-encoder was used to learn compact features from the reduced data. Next, we performed a LASSO problem process to select the optimal combination of extracted features. Lastly, we applied such feature combination to real cancer prognostic models and evaluated the raw predictive effect of the models. The results indicated that these compressed transformation features could better improve the model's original predictive performance and might avoid an overfitting problem. This idea may be enlightening for everyone involved in cancer research, risk reduction, treatment, and patient care integrating genomics data.

摘要

涉及数万个基因的基因组学是一个决定表型的复杂系统。一个有趣且至关重要的问题是如何将高度稀疏的遗传基因组数据与大量微小效应整合到一个预测模型中,以提高预测能力。我们发现深度学习方法可以很好地通过以非线性方式将高度稀疏的二分数据转换为低维连续数据来提取特征。这可能在与风险预测相关的基因型数据方面带来益处。我们开发了一种多阶段策略,从高度稀疏的二元基因型数据中提取信息,并将其应用于癌症预后。具体而言,我们首先使用单变量回归模型将二元生物标志物的规模缩减到适中大小。然后,使用一个可训练的自动编码器从缩减后的数据中学习紧凑特征。接下来,我们进行一个套索问题处理以选择提取特征的最优组合。最后,我们将这种特征组合应用于实际的癌症预后模型,并评估模型的原始预测效果。结果表明,这些压缩变换特征可以更好地提高模型的原始预测性能,并且可能避免过拟合问题。这个想法可能会给参与癌症研究、风险降低、治疗以及患者护理(整合基因组学数据)的每个人带来启发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dea8/9872139/3e427e476c79/fonc-12-1091767-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验