Lam Ching-Wan
Department of Chemical Pathology, The Chinese University of Hong Kong, Prince of Wales Hospital, Hong Kong, China.
Clin Chim Acta. 2008 Mar;389(1-2):7-13. doi: 10.1016/j.cca.2007.11.011. Epub 2007 Nov 23.
Indels (insertions/deletions) are important DNA sequence variations because of the high frequency in the human genome, the deleterious effects on the reading frame and protein expression, and the association with disease and disease susceptibility of common diseases. In a recent study with a human individual with the whole genome sequenced, 292,102 heterozygous indels and 559,473 homozygous indels were identified. Decrypting such a large number of heterozygous indels is computationally intensive and requires efficient algorithms. However, the current algorithms for decrypting heterozygous indel cannot be applied to unprecedented sequenced genomes and cannot be performed without reference sequences or reference sequence tracings for sequenced genomes.
A new algorithm for de novo decrypting of heterozygous indels is conceptualized in the direction of isolating the indel sequence from the genotype or diploid sequence. A universal algorithm is described, here, for heterozygous indel detection, indel size determination, and de novo decrypting of the indel sequence without subtracting the diploid DNA sequence from the reference sequence or reference sequence tracing.
The result obtained by this algorithm is exactly the same as that obtained by PolyPhred and PolyScan. Unlike these algorithms, this new algorithm is not computationally intense for large indels, is independent of sequencing technologies and applies to genotype data derived from all existing sequencing technology platforms. A read of only 29 bases is enough to reduce the false detection rate (FDR) to 1 in a million.
This algorithm is unique amongst all the existing algorithms in terms of performing the task of indel detection, size determination, and decrypting simultaneously. This universal approach eliminates the requirement of a reference sequence or sequence tracing and makes this algorithm unique in decrypting unprecedented sequenced genomes. Because of the high frequency of heterozygous indels in human genome, this universal algorithm will greatly reduce the time required for post-sequencing data analysis in whole genome sequencing of an individual for the practice of personalized medicine.
插入缺失(插入/缺失)是重要的DNA序列变异,因为其在人类基因组中出现频率高,对阅读框和蛋白质表达有有害影响,且与常见疾病的疾病易感性相关。在最近一项对一个进行了全基因组测序的个体的研究中,鉴定出了292,102个杂合插入缺失和559,473个纯合插入缺失。解密如此大量的杂合插入缺失计算量很大,需要高效算法。然而,当前用于解密杂合插入缺失的算法无法应用于前所未有的测序基因组,且在没有参考序列或测序基因组的参考序列追踪的情况下无法执行。
一种用于从头解密杂合插入缺失的新算法是朝着从基因型或二倍体序列中分离插入缺失序列的方向构思的。本文描述了一种通用算法,用于杂合插入缺失检测、插入缺失大小确定以及在不从参考序列或参考序列追踪中减去二倍体DNA序列的情况下从头解密插入缺失序列。
该算法得到的结果与PolyPhred和PolyScan得到的结果完全相同。与这些算法不同,这种新算法对于大的插入缺失计算量不大,独立于测序技术,适用于源自所有现有测序技术平台的基因型数据。仅读取29个碱基就足以将错误检测率(FDR)降低到百万分之一。
该算法在同时执行插入缺失检测、大小确定和解密任务方面在所有现有算法中是独一无二的。这种通用方法消除了对参考序列或序列追踪的要求,使该算法在解密前所未有的测序基因组方面独一无二。由于人类基因组中杂合插入缺失频率很高,这种通用算法将大大减少个体全基因组测序中测序后数据分析所需的时间,以用于个性化医疗实践。