Duan Junbo, Zhang Ji-Gang, Wan Mingxi, Deng Hong-Wen, Wang Yu-Ping
Department of Biomedical Engineering, Xi'an Jiaotong University, Xi'an, P. R. China.
J Bioinform Comput Biol. 2014 Aug;12(4):1450021. doi: 10.1142/S0219720014500218. Epub 2014 Aug 19.
Copy number variations (CNVs) can be used as significant bio-markers and next generation sequencing (NGS) provides a high resolution detection of these CNVs. But how to extract features from CNVs and further apply them to genomic studies such as population clustering have become a big challenge. In this paper, we propose a novel method for population clustering based on CNVs from NGS. First, CNVs are extracted from each sample to form a feature matrix. Then, this feature matrix is decomposed into the source matrix and weight matrix with non-negative matrix factorization (NMF). The source matrix consists of common CNVs that are shared by all the samples from the same group, and the weight matrix indicates the corresponding level of CNVs from each sample. Therefore, using NMF of CNVs one can differentiate samples from different ethnic groups, i.e. population clustering. To validate the approach, we applied it to the analysis of both simulation data and two real data set from the 1000 Genomes Project. The results on simulation data demonstrate that the proposed method can recover the true common CNVs with high quality. The results on the first real data analysis show that the proposed method can cluster two family trio with different ancestries into two ethnic groups and the results on the second real data analysis show that the proposed method can be applied to the whole-genome with large sample size consisting of multiple groups. Both results demonstrate the potential of the proposed method for population clustering.
拷贝数变异(CNV)可作为重要的生物标志物,而新一代测序(NGS)能对这些CNV进行高分辨率检测。但如何从CNV中提取特征并将其进一步应用于群体聚类等基因组研究已成为一大挑战。在本文中,我们提出了一种基于NGS的CNV进行群体聚类的新方法。首先,从每个样本中提取CNV以形成特征矩阵。然后,使用非负矩阵分解(NMF)将该特征矩阵分解为源矩阵和权重矩阵。源矩阵由同一组所有样本共享的常见CNV组成,权重矩阵表示每个样本中CNV的相应水平。因此,通过对CNV进行NMF可以区分不同种族的样本,即进行群体聚类。为了验证该方法,我们将其应用于模拟数据以及千人基因组计划的两个真实数据集的分析。模拟数据的结果表明,所提出的方法能够高质量地恢复真实的常见CNV。第一次真实数据分析的结果表明,所提出的方法可以将具有不同祖先的两个家系三联体聚类为两个种族群体,第二次真实数据分析的结果表明,该方法可以应用于由多个群体组成的大样本全基因组。这两个结果都证明了所提出的方法在群体聚类方面的潜力。