Canzar Stefan, Salzberg Steven L
Proc IEEE Inst Electr Electron Eng. 2017 Mar;105(3):436-458. doi: 10.1109/JPROC.2015.2455551. Epub 2015 Sep 7.
Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads', that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurments, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3 billion base pair long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.
超高通量下一代测序(NGS)技术使我们能够并行确定数百万个DNA分子的核苷酸序列。自2004年问世以来,随着成本大幅降低,NGS技术为解决广泛的生物学和生物医学问题提供了新途径,从人类遗传疾病研究到基因表达分析、蛋白质-DNA相互作用以及DNA甲基化模式分析。NGS仪器生成的数据包含大量非常短的DNA序列,即“读段”,这些读段本身携带的信息很少。因此,必须通过精心设计的算法将这些读段拼接起来,以重建具有生物学意义的测量结果,例如基因的表达水平。为了解决这个复杂的高维难题,读段必须映射回参考基因组以确定其来源。由于测序错误以及参考基因组与被测序个体之间的真实差异,这个映射过程必须容忍错配、插入和缺失。尽管长期以来一直有解决此问题的最优比对算法,但将数亿个短读段与长达30亿碱基对的人类基因组进行比对的实际需求,推动了更高效新方法的开发,如今这些方法在全球范围内被常规用于分析NGS数据。