Schwartz Russell
Department of Biological Sciences, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA 15213, USA.
Appl Bioinformatics. 2004;3(2-3):181-91. doi: 10.2165/00822942-200403020-00012.
While the shared consensus genetic sequence of our species contains a great deal of information about our common biology, there is also much to be learned from the subtle genetic variations across our species. These variations are believed to be generally of little or no direct functional significance and predominantly reflect the chance accumulation of small genetic changes since our emergence as a species. Therefore, they carry little useful information when observed in a single individual. When tallied across a whole population though, these chance mutations can teach us a great deal about our evolutionary history and the patterns of inheritance in particular individuals. In particular, frequently observed patterns of single nucleotide polymorphisms (SNPs) in a population can identify segments of chromosome that have been passed down largely intact through long stretches of our evolution. Finding these frequently conserved chromosomal segments, or haplotypes, and developing methods to identify haplotype patterns in particular individuals, will in turn help us to identify those particular segments that carry genetic factors influencing risk for many common human diseases. To make the best use of this data, we will need to develop new models for the encoding of information in genome variations--the "language of genetic variation"--and new algorithms for fitting datasets to those models. This article surveys past work by the author and colleagues on this problem, utilising computational methods for locating frequent patterns in haploid sequence data, and "parsing" sequences so as to optimally explain them given the knowledge of the general population structure. The author's recent work in this area has been compiled into a set of computational tools available at http://www-2.cs.cmu.edu/~russells/software/hapmotif.html.
虽然我们人类共有的共识基因序列包含了大量关于我们共同生物学特征的信息,但从我们整个物种中细微的基因变异中也能学到很多东西。这些变异通常被认为几乎没有或没有直接的功能意义,主要反映了自我们作为一个物种出现以来小基因变化的偶然积累。因此,当在单个个体中观察到这些变异时,它们携带的有用信息很少。然而,当在整个人口中统计这些变异时,这些偶然的突变可以让我们深入了解我们的进化历史,尤其是特定个体的遗传模式。特别是,在一个群体中经常观察到的单核苷酸多态性(SNP)模式可以识别出在我们漫长的进化过程中基本完整传递下来的染色体片段。找到这些经常保守的染色体片段,即单倍型,并开发方法来识别特定个体中的单倍型模式,反过来将帮助我们识别那些携带影响许多常见人类疾病风险的遗传因素的特定片段。为了充分利用这些数据,我们需要开发新的模型来编码基因组变异中的信息——“遗传变异的语言”——以及新的算法,以便将数据集与这些模型进行拟合。本文回顾了作者及其同事过去在这个问题上的工作,利用计算方法在单倍体序列数据中定位频繁模式,并“解析”序列,以便在了解总体人群结构的情况下对其进行最佳解释。作者最近在这一领域的工作已被编译成一套计算工具,可在http://www-2.cs.cmu.edu/~russells/software/hapmotif.html上获取。