Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI 53706, USA.
Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA.
Bioinformatics. 2021 May 23;37(8):1115-1124. doi: 10.1093/bioinformatics/btaa935.
Gene expression and regulation, a key molecular mechanism driving human disease development, remains elusive, especially at early stages. Integrating the increasing amount of population-level genomic data and understanding gene regulatory mechanisms in disease development are still challenging. Machine learning has emerged to solve this, but many machine learning methods were typically limited to building an accurate prediction model as a 'black box', barely providing biological and clinical interpretability from the box.
To address these challenges, we developed an interpretable and scalable machine learning model, ECMarker, to predict gene expression biomarkers for disease phenotypes and simultaneously reveal underlying regulatory mechanisms. Particularly, ECMarker is built on the integration of semi- and discriminative-restricted Boltzmann machines, a neural network model for classification allowing lateral connections at the input gene layer. This interpretable model is scalable without needing any prior feature selection and enables directly modeling and prioritizing genes and revealing potential gene networks (from lateral connections) for the phenotypes. With application to the gene expression data of non-small-cell lung cancer patients, we found that ECMarker not only achieved a relatively high accuracy for predicting cancer stages but also identified the biomarker genes and gene networks implying the regulatory mechanisms in the lung cancer development. In addition, ECMarker demonstrates clinical interpretability as its prioritized biomarker genes can predict survival rates of early lung cancer patients (P-value < 0.005). Finally, we identified a number of drugs currently in clinical use for late stages or other cancers with effects on these early lung cancer biomarkers, suggesting potential novel candidates on early cancer medicine.
ECMarker is open source as a general-purpose tool at https://github.com/daifengwanglab/ECMarker.
Supplementary data are available at Bioinformatics online.
基因表达和调控是驱动人类疾病发展的关键分子机制,但仍难以捉摸,尤其是在早期阶段。整合越来越多的人群基因组数据并理解疾病发展中的基因调控机制仍然具有挑战性。机器学习已被用于解决这一问题,但许多机器学习方法通常仅限于构建一个准确的预测模型作为“黑盒”,几乎无法从盒中提供生物学和临床可解释性。
为了解决这些挑战,我们开发了一种可解释且可扩展的机器学习模型 ECMarker,用于预测疾病表型的基因表达生物标志物,并同时揭示潜在的调控机制。特别是,ECMarker 建立在半判别受限玻尔兹曼机的整合之上,这是一种用于分类的神经网络模型,允许在输入基因层进行横向连接。这个可解释的模型是可扩展的,不需要任何预先的特征选择,并且能够直接对基因进行建模和优先级排序,并揭示潜在的基因网络(来自横向连接)用于表型。在非小细胞肺癌患者的基因表达数据上的应用表明,ECMarker 不仅实现了相对较高的癌症分期预测准确性,而且还鉴定了生物标志物基因和基因网络,暗示了肺癌发展中的调控机制。此外,ECMarker 具有临床可解释性,因为其优先级生物标志物基因可以预测早期肺癌患者的生存率(P 值<0.005)。最后,我们确定了一些目前用于晚期或其他癌症的药物对这些早期肺癌生物标志物的作用,这表明了早期癌症药物的潜在新候选药物。
ECMarker 是一个开源的通用工具,可在 https://github.com/daifengwanglab/ECMarker 上获得。
补充数据可在生物信息学在线获得。