Mallik Saurav, Zhao Zhongming
Computer Science & Engineering, Aliah University, Newtown, Newtown 700156, India.
Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
Quant Biol. 2017 Dec;5(4):302-327. doi: 10.1007/s40484-017-0119-0. Epub 2017 Nov 23.
Marker detection is an important task in complex disease studies. Here we provide an association rule mining (ARM) based approach for identifying integrated markers through mutual information (MI) based statistically significant feature extraction, and apply it to acute myeloid leukemia (AML) and prostate carcinoma (PC) gene expression and methylation profiles.
We first collect the genes having both expression and methylation values in AML as well as PC. Next, we run Jarque-Bera normality test on the expression/methylation data to divide the whole dataset into two parts: one that ollows normal distribution and the other that does not follow normal distribution. Thus, we have now four parts of the dataset: normally distributed expression data, normally distributed methylation data, non-normally distributed expression data, and non-normally distributed methylated data. A feature-extraction technique, "" is then utilized on each part. This results in a list of top-ranked genes. Next, we apply Welch -test (parametric test) and Shrink -test (non-parametric test) on the expression/methylation data for the top selected normally distributed genes and non-normally distributed genes, respectively. We then use a recent weighted ARM method, "RANWAR" to combine all/specific resultant genes to generate top oncogenic rules along with respective integrated markers. Finally, we perform literature search as well as KEGG pathway and Gene-Ontology (GO) analyses using Enrichr database for validation of the prioritized oncogenes as the markers and labeling the markers as existing or novel.
The novel markers of AML are {ABCB11↑∪KRT17↓} (i.e., ABCB11 as up-regulated, & KRT17 as down-regulated), and {AP1S1-∪KRT17↓∪NEIL2-∪DYDC1↓}) (i.e., AP1S1 and NEIL2 both as hypo-methylated, & KRT17 and DYDC1 both as down-regulated). The novel marker of PC is {UBIAD1¶∪APBA2‡∪C4orf31‡} (i.e., UBIAD1 as up-regulated and hypo-methylated, & APBA2 and C4orf31 both as down-regulated and hyper-methylated).
The identified novel markers might have critical roles in AML as well as PC. The approach can be applied to other complex disease.
标记物检测是复杂疾病研究中的一项重要任务。在此,我们提供一种基于关联规则挖掘(ARM)的方法,通过基于互信息(MI)的具有统计学意义的特征提取来识别综合标记物,并将其应用于急性髓系白血病(AML)和前列腺癌(PC)的基因表达及甲基化谱。
我们首先收集AML以及PC中同时具有表达值和甲基化值的基因。接下来,我们对表达/甲基化数据进行Jarque-Bera正态性检验,将整个数据集分为两部分:一部分服从正态分布,另一部分不服从正态分布。这样,我们现在有数据集的四个部分:正态分布的表达数据、正态分布的甲基化数据、非正态分布的表达数据以及非正态分布的甲基化数据。然后在每个部分上使用一种特征提取技术“”。这会得到一个排名靠前的基因列表。接下来,我们分别对所选排名靠前的正态分布基因和非正态分布基因的表达/甲基化数据应用Welch检验(参数检验)和Shrink检验(非参数检验)。然后,我们使用一种最新的加权ARM方法“RANWAR”来组合所有/特定的结果基因,以生成顶级致癌规则以及各自的综合标记物。最后,我们使用Enrichr数据库进行文献检索以及KEGG通路和基因本体(GO)分析,以验证作为标记物的优先致癌基因,并将这些标记物标记为已知或新发现的。
AML的新标记物是{ABCB11↑∪KRT17↓}(即ABCB11上调,KRT17下调),以及{AP1S1-∪KRT17↓∪NEIL2-∪DYDC1↓}(即AP1S1和NEIL2均为低甲基化,KRT17和DYDC1均为下调)。PC的新标记物是{UBIAD1¶∪APBA2‡∪C4orf31‡}(即UBIAD1上调且低甲基化,APBA2和C4orf31均下调且高甲基化)。
所识别的新标记物可能在AML以及PC中起关键作用。该方法可应用于其他复杂疾病。