Hu Hao, Liu Xiang, Jin Wenfei, Hilger Ropers H, Wienker Thomas F
Department of human molecular genetics, Max-Planck Institute for Molecular Genetics, Berlin, 14195, Germany.
BlackBerry Deutschland GmbH, Bochum, 44799, Germany.
Sci Rep. 2015 May 15;5:10247. doi: 10.1038/srep10247.
Sample-tagging is designed for identification of accidental sample mix-up, which is a major issue in re-sequencing studies. In this work, we develop a model to measure the information content of SNPs, so that we can optimize a panel of SNPs that approach the maximal information for discrimination. The analysis shows that as low as 60 optimized SNPs can differentiate the individuals in a population as large as the present world, and only 30 optimized SNPs are in practice sufficient in labeling up to 100 thousand individuals. In the simulated populations of 100 thousand individuals, the average Hamming distances, generated by the optimized set of 30 SNPs are larger than 18, and the duality frequency, is lower than 1 in 10 thousand. This strategy of sample discrimination is proved robust in large sample size and different datasets. The optimized sets of SNPs are designed for Whole Exome Sequencing, and a program is provided for SNP selection, allowing for customized SNP numbers and interested genes. The sample-tagging plan based on this framework will improve re-sequencing projects in terms of reliability and cost-effectiveness.
样本标记旨在识别意外的样本混淆,这是重测序研究中的一个主要问题。在这项工作中,我们开发了一个模型来测量单核苷酸多态性(SNP)的信息含量,以便我们能够优化一组SNP,使其接近用于区分的最大信息。分析表明,低至60个优化的SNP就能区分像当今世界这么大的人群中的个体,而实际上仅30个优化的SNP就足以标记多达10万人。在10万人的模拟群体中,由30个优化SNP组成的集合产生的平均汉明距离大于18,对偶频率低于万分之一。这种样本区分策略在大样本量和不同数据集中被证明是稳健的。优化的SNP集合是为全外显子组测序设计的,并且提供了一个用于SNP选择的程序,允许定制SNP数量和感兴趣的基因。基于此框架的样本标记计划将在可靠性和成本效益方面改进重测序项目。