Department of Bio and Brain Engineering, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea.
Bioinformatics. 2010 Jun 15;26(12):1506-12. doi: 10.1093/bioinformatics/btq207. Epub 2010 Apr 21.
Gene set analysis has become an important tool for the functional interpretation of high-throughput gene expression datasets. Moreover, pattern analyses based on inferred gene set activities of individual samples have shown the ability to identify more robust disease signatures than individual gene-based pattern analyses. Although a number of approaches have been proposed for gene set-based pattern analysis, the combinatorial influence of deregulated gene sets on disease phenotype classification has not been studied sufficiently.
We propose a new approach for inferring combinatorial Boolean rules of gene sets for a better understanding of cancer transcriptome and cancer classification. To reduce the search space of the possible Boolean rules, we identify small groups of gene sets that synergistically contribute to the classification of samples into their corresponding phenotypic groups (such as normal and cancer). We then measure the significance of the candidate Boolean rules derived from each group of gene sets; the level of significance is based on the class entropy of the samples selected in accordance with the rules. By applying the present approach to publicly available prostate cancer datasets, we identified 72 significant Boolean rules. Finally, we discuss several identified Boolean rules, such as the rule of glutathione metabolism (down) and prostaglandin synthesis regulation (down), which are consistent with known prostate cancer biology.
Scripts written in Python and R are available at http://biosoft.kaist.ac.kr/~ihpark/. The refined gene sets and the full list of the identified Boolean rules are provided in the Supplementary Material.
Supplementary data are available at Bioinformatics online.
基因集分析已成为对高通量基因表达数据集进行功能解释的重要工具。此外,基于个体样本推断出的基因集活性的模式分析已经显示出比基于单个基因的模式分析更能识别稳健的疾病特征的能力。尽管已经提出了许多基于基因集的模式分析方法,但基因集的组合失调对疾病表型分类的综合影响尚未得到充分研究。
我们提出了一种新的方法来推断基因集的组合布尔规则,以更好地理解癌症转录组和癌症分类。为了减少可能的布尔规则的搜索空间,我们确定了一小组协同作用于将样本分类到其相应表型组(如正常和癌症)的基因集。然后,我们测量从每组基因集中得出的候选布尔规则的显著性;显著性水平基于根据规则选择的样本的类别熵。通过将本方法应用于公开的前列腺癌数据集,我们确定了 72 个显著的布尔规则。最后,我们讨论了几个已识别的布尔规则,例如谷胱甘肽代谢(下调)和前列腺素合成调节(下调)规则,这些规则与已知的前列腺癌生物学一致。
用 Python 和 R 编写的脚本可在 http://biosoft.kaist.ac.kr/~ihpark/ 获得。经过精炼的基因集和完整的已识别布尔规则列表可在补充材料中获得。
补充数据可在生物信息学在线获得。