Lu Hao, Rezapour Mostafa, Baha Haseebullah, Khalid Khan Niazi Muhammad, Narayanan Aarthi, Nafi Gurcan Metin
Center for Artificial Intelligence Research, Wake Forest University School of Medicine, Winston-Salem, NC 27101, United States.
School of Systems Biology, College of Science, George Mason University, Fairfax, VA 22030, United States.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf039.
Pathway analysis plays a critical role in bioinformatics, enabling researchers to identify biological pathways associated with various conditions by analyzing gene expression data. However, the rise of large, multi-center datasets has highlighted limitations in traditional methods like Over-Representation Analysis (ORA) and Functional Class Scoring (FCS), which struggle with low signal-to-noise ratios (SNR) and large sample sizes. To tackle these challenges, we use a deep learning-based classification method, Gene PointNet, and a novel $P$-value computation approach leveraging the confusion matrix to address pathway analysis tasks. We validated our method effectiveness through a comparative study using a simulated dataset and RNA-Seq data from The Cancer Genome Atlas breast cancer dataset. Our method was benchmarked against traditional techniques (ORA, FCS), shallow machine learning models (logistic regression, support vector machine), and deep learning approaches (DeepHisCom, PASNet). The results demonstrate that GPNet outperforms these methods in low-SNR, large-sample datasets, where it remains robust and reliable, significantly reducing both Type I error and improving power. This makes our method well suited for pathway analysis in large, multi-center studies. The code can be found at https://github.com/haolu123/GPNet_pathway">https://github.com/haolu123/GPNet_pathway.
通路分析在生物信息学中起着关键作用,它使研究人员能够通过分析基因表达数据来识别与各种疾病相关的生物通路。然而,大型多中心数据集的出现凸显了传统方法(如过表达分析(ORA)和功能类评分(FCS))的局限性,这些方法在低信噪比(SNR)和大样本量的情况下存在困难。为了应对这些挑战,我们使用了一种基于深度学习的分类方法——基因点云网络(Gene PointNet),以及一种利用混淆矩阵的新型P值计算方法来处理通路分析任务。我们通过使用模拟数据集和来自癌症基因组图谱乳腺癌数据集的RNA测序数据进行比较研究,验证了我们方法的有效性。我们的方法与传统技术(ORA、FCS)、浅层机器学习模型(逻辑回归、支持向量机)和深度学习方法(DeepHisCom、PASNet)进行了基准测试。结果表明,在低信噪比、大样本数据集中,基因点云网络(GPNet)优于这些方法,它在这些数据集中保持稳健可靠,显著降低了I型错误并提高了检验效能。这使得我们的方法非常适合在大型多中心研究中进行通路分析。代码可在https://github.com/haolu123/GPNet_pathway上找到。