Meng Yajie, Jin Min
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China.
Front Cell Dev Biol. 2021 Jun 30;9:696359. doi: 10.3389/fcell.2021.696359. eCollection 2021.
The emergence of high-throughput RNA-seq data has offered unprecedented opportunities for cancer diagnosis. However, capturing biological data with highly nonlinear and complex associations by most existing approaches for cancer diagnosis has been challenging. In this study, we propose a novel hierarchical feature selection and second learning probability error ensemble model (named HFS-SLPEE) for precision cancer diagnosis. Specifically, we first integrated protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to provide rich information; afterward, we designed a novel hierarchical feature selection method, which takes the CpG-gene biological associations into account and can select a compact set of superior features; next, we used four individual classifiers with significant differences and apparent complementary to build the heterogeneous classifiers; lastly, we developed a second learning probability error ensemble model called SLPEE to thoroughly learn the new data consisting of classifiers-predicted class probability values and the actual label, further realizing the self-correction of the diagnosis errors. Benchmarking comparisons on TCGA showed that HFS-SLPEE performs better than the state-of-the-art approaches. Moreover, we analyzed in-depth 10 groups of selected features and found several novel HFS-SLPEE-predicted epigenomics and epigenetics biomarkers for breast invasive carcinoma (BRCA) (e.g., TSLP and ADAMTS9-AS2), lung adenocarcinoma (LUAD) (e.g., HBA1 and CTB-43E15.1), and kidney renal clear cell carcinoma (KIRC) (e.g., IRX2 and BMPR1B-AS1).
高通量RNA测序数据的出现为癌症诊断提供了前所未有的机遇。然而,大多数现有的癌症诊断方法在捕捉具有高度非线性和复杂关联的生物学数据方面一直面临挑战。在本研究中,我们提出了一种用于精准癌症诊断的新型分层特征选择和二次学习概率误差集成模型(名为HFS-SLPEE)。具体而言,我们首先整合了蛋白质编码基因表达谱、非编码RNA表达谱和DNA甲基化数据,以提供丰富的信息;随后,我们设计了一种新型的分层特征选择方法,该方法考虑了CpG-基因生物学关联,能够选择一组紧凑的优质特征;接下来,我们使用四个具有显著差异和明显互补性的个体分类器来构建异构分类器;最后,我们开发了一种名为SLPEE的二次学习概率误差集成模型,以全面学习由分类器预测的类概率值和实际标签组成的新数据,进一步实现诊断误差的自我校正。在TCGA上的基准比较表明,HFS-SLPEE的性能优于现有最先进的方法。此外,我们深入分析了10组选定的特征,发现了几种用于乳腺浸润性癌(BRCA)(例如,TSLP和ADAMTS9-AS2)、肺腺癌(LUAD)(例如HBA1和CTB-43E15.1)和肾透明细胞癌(KIRC)(例如IRX2和BMPR1B-AS1)的新型HFS-SLPEE预测的表观基因组学和表观遗传学生物标志物。