使用惩罚加权归一化割算法对基因表达数据进行重叠聚类
Overlapping clustering of gene expression data using penalized weighted normalized cut.
作者信息
Teran Hidalgo Sebastian J, Zhu Tingyu, Wu Mengyun, Ma Shuangge
机构信息
Department of Biostatistics, Yale University, New Haven, Connecticut.
Department of Statistics, Xiamen University, Xiamen, China.
出版信息
Genet Epidemiol. 2018 Dec;42(8):796-811. doi: 10.1002/gepi.22164. Epub 2018 Oct 9.
Clustering has been widely conducted in the analysis of gene expression data. For complex diseases, it has played an important role in identifying unknown functions of genes, serving as the basis of other analysis, and others. A common limitation of most existing clustering approaches is to assume that genes are separated into disjoint clusters. As genes often have multiple functions and thus can belong to more than one functional cluster, the disjoint clustering results can be unsatisfactory. In addition, due to the small sample sizes of genetic profiling studies and other factors, there may not be sufficient evidence to confirm the specific functions of some genes and cluster them definitively into disjoint clusters. In this study, we develop an effective overlapping clustering approach, which takes account into the multiplicity of gene functions and lack of certainty in practical analysis. A penalized weighted normalized cut (PWNCut) criterion is proposed based on the NCut technique and an norm constraint. It outperforms multiple competitors in simulation. The analysis of the cancer genome atlas (TCGA) data on breast cancer and cervical cancer leads to biologically sensible findings which differ from those using the alternatives. To facilitate implementation, we develop the function pwncut in the R package NCutYX.
聚类已广泛应用于基因表达数据分析中。对于复杂疾病,它在识别基因的未知功能、作为其他分析的基础等方面发挥了重要作用。大多数现有聚类方法的一个常见局限性是假设基因被划分为不相交的簇。由于基因通常具有多种功能,因此可以属于多个功能簇,不相交的聚类结果可能并不理想。此外,由于基因谱研究的样本量较小以及其他因素,可能没有足够的证据来确认某些基因的特定功能并将它们明确地聚类到不相交的簇中。在本研究中,我们开发了一种有效的重叠聚类方法,该方法考虑了基因功能的多样性以及实际分析中缺乏确定性的问题。基于NCut技术和一个范数约束,提出了一种惩罚加权归一化割(PWNCut)准则。在模拟中,它优于多个竞争对手。对癌症基因组图谱(TCGA)中乳腺癌和宫颈癌数据的分析得出了与使用其他方法不同的具有生物学意义的结果。为便于实现,我们在R包NCutYX中开发了函数pwncut。