College of Agronomy, Gansu Agricultural University, Lanzhou, Gansu, China.
School of Agriculture, Middle Tennessee State University, Murfreesboro, Tennessee, United States of America.
PLoS Comput Biol. 2019 Aug 12;15(8):e1007264. doi: 10.1371/journal.pcbi.1007264. eCollection 2019 Aug.
Accurately predicting and testing the types of Pulmonary arterial hypertension (PAH) of each patient using cost-effective microarray-based expression data and machine learning algorithms could greatly help either identifying the most targeting medicine or adopting other therapeutic measures that could correct/restore defective genetic signaling at the early stage. Furthermore, the prediction model construction processes can also help identifying highly informative genes controlling PAH, leading to enhanced understanding of the disease etiology and molecular pathways. In this study, we used several different gene filtering methods based on microarray expression data obtained from a high-quality patient PAH dataset. Following that, we proposed a novel feature selection and refinement algorithm in conjunction with well-known machine learning methods to identify a small set of highly informative genes. Results indicated that clusters of small-expression genes could be extremely informative at predicting and differentiating different forms of PAH. Additionally, our proposed novel feature refinement algorithm could lead to significant enhancement in model performance. To summarize, integrated with state-of-the-art machine learning and novel feature refining algorithms, the most accurate models could provide near-perfect classification accuracies using very few (close to ten) low-expression genes.
利用具有成本效益的基于微阵列的表达数据和机器学习算法准确预测和测试每位患者的肺动脉高压 (PAH) 类型,这将极大地帮助识别最靶向药物或采用其他治疗措施,从而在早期纠正/恢复有缺陷的遗传信号。此外,预测模型构建过程还可以帮助识别控制 PAH 的高信息量基因,从而增强对疾病病因和分子途径的理解。在这项研究中,我们使用了几种不同的基因过滤方法,这些方法基于从高质量患者 PAH 数据集获得的微阵列表达数据。之后,我们提出了一种新的特征选择和细化算法,结合了著名的机器学习方法,以识别一小组高信息量基因。结果表明,小表达基因簇在预测和区分不同形式的 PAH 方面极具信息量。此外,我们提出的新特征细化算法可以显著提高模型性能。总之,结合最先进的机器学习和新颖的特征细化算法,最准确的模型可以使用非常少的(接近十个)低表达基因提供近乎完美的分类准确率。