Du Yonghong, Martin Joshua S, McGee John, Yang Yuchen, Liu Eric Yi, Sun Yingrui, Geihs Matthias, Kong Xuejun, Zhou Eric Lingfeng, Li Yun, Huang Jie
School of Statistics, Beijing Normal University, Beijing, China.
Department of Genetics, University of North Carolina Chapel Hill, Chapel Hill, North Carolina, United States of America.
PLoS One. 2017 Sep 19;12(9):e0182438. doi: 10.1371/journal.pone.0182438. eCollection 2017.
In the current precision medicine era, more and more samples get genotyped and sequenced. Both researchers and commercial companies expend significant time and resources to reduce the error rate. However, it has been reported that there is a sample mix-up rate of between 0.1% and 1%, not to mention the possibly higher mix-up rate during the down-stream genetic reporting processes. Even on the low end of this estimate, this translates to a significant number of mislabeled samples, especially over the projected one billion people that will be sequenced within the next decade. Here, we first describe a method to identify a small set of Single nucleotide polymorphisms (SNPs) that can uniquely identify a personal genome, which utilizes allele frequencies of five major continental populations reported in the 1000 genomes project and the ExAC Consortium. To make this panel more informative, we added four SNPs that are commonly used to predict ABO blood type, and another two SNPs that are capable of predicting sex. We then implement a web interface (http://qrcme.tech), nicknamed QRC (for QR code based Concordance check), which is capable of extracting the relevant ID SNPs from a raw genetic data, coding its genotype as a quick response (QR) code, and comparing QR codes to report the concordance of underlying genetic datasets. The resulting 80 fingerprinting SNPs represent a significant decrease in complexity and the number of markers used for genetic data labelling and tracking. Our method and web tool is easily accessible to both researchers and the general public who consider the accuracy of complex genetic data as a prerequisite towards precision medicine.
在当前的精准医学时代,越来越多的样本进行了基因分型和测序。研究人员和商业公司都花费了大量的时间和资源来降低错误率。然而,据报道,样本混淆率在0.1%至1%之间,更不用说在下游基因报告过程中可能更高的混淆率了。即使按照这个估计的下限,这也意味着有大量错误标记的样本,特别是在预计未来十年内将进行测序的10亿人口中。在这里,我们首先描述一种方法,该方法利用1000基因组计划和ExAC联盟报告的五个主要大陆人群的等位基因频率,来识别一小部分能够唯一识别个人基因组的单核苷酸多态性(SNP)。为了使这个面板更具信息性,我们添加了四个常用于预测ABO血型的SNP,以及另外两个能够预测性别的SNP。然后,我们实现了一个网络界面(http://qrcme.tech),简称为QRC(基于二维码的一致性检查),它能够从原始遗传数据中提取相关的身份SNP,将其基因型编码为快速响应(QR)码,并比较QR码以报告基础遗传数据集的一致性。由此产生的80个指纹SNP代表了复杂性和用于遗传数据标记和追踪的标记数量的显著减少。我们的方法和网络工具对于研究人员和普通公众来说都很容易获取,他们将复杂遗传数据的准确性视为精准医学的先决条件。