Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, PO Box 123, Broadway, Sydney, 2007, NSW, Australia.
Faculty of Engineering and Information Technology, University of Technology Sydney, PO Box 123, Broadway, Sydney, 2007, NSW, Australia.
BMC Med Genomics. 2019 Dec 20;12(Suppl 8):183. doi: 10.1186/s12920-019-0630-4.
The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two main drawbacks: First, these methods are based on fixed statistical hypotheses and not always effective; Second, these methods can not identify a certain expression level boundary when there is no obvious expression level gap between control and experiment groups.
This paper proposed a novel approach to identify marker genes and gene expression level boundary for lung cancer. By calculating a kernel maximum mean discrepancy, our method can evaluate the expression differences between normal, normal adjacent to tumor (NAT) and tumor samples. For the potential marker genes, the expression level boundaries among different groups are defined with the information entropy method.
Compared with two conventional methods t-test and fold change, the top average ranked genes selected by our method can achieve better performance under all metrics in the 10-fold cross-validation. Then GO and KEGG enrichment analysis are conducted to explore the biological function of the top 100 ranked genes. At last, we choose the top 10 average ranked genes as lung cancer markers and their expression boundaries are calculated and reported.
The proposed approach is effective to identify gene markers for lung cancer diagnosis. It is not only more accurate than conventional DEA methods but also provides a reliable method to identify the gene expression level boundaries.
肺癌的早期诊断一直是临床实践中的一个关键问题,而将差异表达基因作为疾病标志物是一种很有前途的解决方案。然而,大多数现有的基因差异表达分析(DEA)方法存在两个主要缺点:首先,这些方法基于固定的统计假设,并不总是有效;其次,当对照组和实验组之间没有明显的表达水平差距时,这些方法无法确定某个表达水平的边界。
本文提出了一种用于识别肺癌标志物基因和基因表达水平边界的新方法。通过计算核最大均值差异,我们的方法可以评估正常、肿瘤附近正常(NAT)和肿瘤样本之间的表达差异。对于潜在的标志物基因,我们使用信息熵方法定义不同组之间的表达水平边界。
与 t 检验和倍数变化两种传统方法相比,我们的方法在 10 倍交叉验证中,所有指标的平均排名最高的基因都能取得更好的性能。然后进行 GO 和 KEGG 富集分析,以探讨排名前 100 的基因的生物学功能。最后,我们选择排名前 10 的平均排名基因作为肺癌标志物,并计算和报告它们的表达边界。
所提出的方法可有效识别肺癌诊断的基因标志物。它不仅比传统的 DEA 方法更准确,而且还提供了一种可靠的方法来确定基因表达水平的边界。