Wang Niya, Hoffman Eric P, Chen Lulu, Chen Li, Zhang Zhen, Liu Chunyu, Yu Guoqiang, Herrington David M, Clarke Robert, Wang Yue
Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA.
Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20007, USA.
Sci Rep. 2016 Jan 7;6:18909. doi: 10.1038/srep18909.
Tissue heterogeneity is both a major confounding factor and an underexploited information source. While a handful of reports have demonstrated the potential of supervised computational methods to deconvolute tissue heterogeneity, these approaches require a priori information on the marker genes or composition of known subpopulations. To address the critical problem of the absence of validated marker genes for many (including novel) subpopulations, we describe convex analysis of mixtures (CAM), a fully unsupervised in silico method, for identifying subpopulation marker genes directly from the original mixed gene expressions in scatter space that can improve molecular analyses in many biological contexts. Validated with predesigned mixtures, CAM on the gene expression data from peripheral leukocytes, brain tissue, and yeast cell cycle, revealed novel marker genes that were otherwise undetectable using existing methods. Importantly, CAM requires no a priori information on the number, identity, or composition of the subpopulations present in mixed samples, and does not require the presence of pure subpopulations in sample space. This advantage is significant in that CAM can achieve all of its goals using only a small number of heterogeneous samples, and is more powerful to distinguish between phenotypically similar subpopulations.
组织异质性既是一个主要的混杂因素,也是一个未被充分利用的信息来源。虽然少数报告已经证明了监督计算方法在解卷积组织异质性方面的潜力,但这些方法需要关于标记基因或已知亚群组成的先验信息。为了解决许多(包括新的)亚群缺乏经过验证的标记基因这一关键问题,我们描述了混合物的凸分析(CAM),这是一种完全无监督的计算机方法,用于直接从散点空间中的原始混合基因表达中识别亚群标记基因,从而可以在许多生物学背景下改进分子分析。通过预先设计的混合物进行验证,对来自外周血白细胞、脑组织和酵母细胞周期的基因表达数据进行CAM分析,揭示了使用现有方法无法检测到的新标记基因。重要的是,CAM不需要关于混合样本中存在的亚群数量、身份或组成的先验信息,也不需要样本空间中存在纯亚群。这一优势非常显著,因为CAM仅使用少量异质样本就能实现其所有目标,并且在区分表型相似的亚群方面更强大。