Chen Yongnan, Liu Songsong, Papageorgiou Lazaros G, Theofilatos Konstantinos, Tsoka Sophia
Department of Informatics, Faculty of Natural, Mathematical and Engineering Sciences, King's College London, Bush House, London WC2B 4BG, UK.
School of Management, Harbin Institute of Technology, Harbin 150001, China.
Cancers (Basel). 2023 Mar 15;15(6):1787. doi: 10.3390/cancers15061787.
With advances in high-throughput technologies, there has been an enormous increase in data related to profiling the activity of molecules in disease. While such data provide more comprehensive information on cellular actions, their large volume and complexity pose difficulty in accurate classification of disease phenotypes. Therefore, novel modelling methods that can improve accuracy while offering interpretable means of analysis are required. Biological pathways can be used to incorporate a priori knowledge of biological interactions to decrease data dimensionality and increase the biological interpretability of machine learning models.
A mathematical optimisation model is proposed for pathway activity inference towards precise disease phenotype prediction and is applied to RNA-Seq datasets. The model is based on mixed-integer linear programming (MILP) mathematical optimisation principles and infers pathway activity as the linear combination of pathway member gene expression, multiplying expression values with model-determined gene weights that are optimised to maximise discrimination of phenotype classes and minimise incorrect sample allocation.
The model is evaluated on the transcriptome of breast and colorectal cancer, and exhibits solution results of good optimality as well as good prediction performance on related cancer subtypes. Two baseline pathway activity inference methods and three advanced methods are used for comparison. Sample prediction accuracy, robustness against noise expression data, and survival analysis suggest competitive prediction performance of our model while providing interpretability and insight on key pathways and genes. Overall, our work demonstrates that the flexible nature of mathematical programming lends itself well to developing efficient computational strategies for pathway activity inference and disease subtype prediction.
随着高通量技术的进步,与疾病中分子活性分析相关的数据大幅增加。虽然这些数据提供了关于细胞作用的更全面信息,但它们的大容量和复杂性给疾病表型的准确分类带来了困难。因此,需要能够提高准确性同时提供可解释分析方法的新型建模方法。生物途径可用于纳入生物相互作用的先验知识,以降低数据维度并提高机器学习模型的生物学可解释性。
提出了一种用于途径活性推断以实现精确疾病表型预测的数学优化模型,并将其应用于RNA测序数据集。该模型基于混合整数线性规划(MILP)数学优化原理,将途径活性推断为途径成员基因表达的线性组合,将表达值与模型确定的基因权重相乘,这些权重经过优化以最大化表型类别的区分度并最小化错误的样本分配。
该模型在乳腺癌和结直肠癌的转录组上进行了评估,在相关癌症亚型上表现出良好的最优解结果以及良好的预测性能。使用了两种基线途径活性推断方法和三种先进方法进行比较。样本预测准确性、对噪声表达数据的鲁棒性以及生存分析表明,我们的模型具有竞争性的预测性能,同时提供了对关键途径和基因的可解释性和见解。总体而言,我们的工作表明数学规划的灵活性非常适合开发用于途径活性推断和疾病亚型预测的高效计算策略。