Yang Albert C-C, Goldberger Ary L, Peng C-K
Cardiovascular Division and Margret and H.A. Rey Institute for Nonlinear Dynamics in Medicine, Beth Israel Deaconess Medical Center/Harvard Medical School, Boston, Massachusetts 02215, USA.
J Comput Biol. 2005 Oct;12(8):1103-16. doi: 10.1089/cmb.2005.12.1103.
Measures of genetic distance based on alignment methods are confined to studying sequences that are conserved and identifiable in all organisms under study. A number of alignment-free techniques based on either statistical linguistics or information theory have been developed to overcome the limitations of alignment methods. We present a novel alignment-free approach to measuring the similarity among genetic sequences that incorporates elements from both word rank order-frequency statistics and information theory. We first validate this method on the human influenza A viral genomes as well as on the human mitochondrial DNA database. We then apply the method to study the origin of the SARS coronavirus. We find that the majority of the SARS genome is most closely related to group 1 coronaviruses, with smaller regions of matches to sequences from groups 2 and 3. The information based similarity index provides a new tool to measure the similarity between datasets based on their information content and may have a wide range of applications in the large-scale analysis of genomic databases.
基于比对方法的遗传距离测量方法仅限于研究在所研究的所有生物体中保守且可识别的序列。已经开发了许多基于统计语言学或信息论的无比对技术来克服比对方法的局限性。我们提出了一种新颖的无比对方法来测量遗传序列之间的相似性,该方法融合了词序频率统计和信息论的元素。我们首先在人类甲型流感病毒基因组以及人类线粒体DNA数据库上验证了该方法。然后我们应用该方法研究严重急性呼吸综合征冠状病毒的起源。我们发现,严重急性呼吸综合征基因组的大部分与第1组冠状病毒关系最为密切,与第2组和第3组序列匹配的区域较小。基于信息的相似性指数提供了一种基于数据集的信息内容来测量数据集之间相似性的新工具,并且可能在基因组数据库的大规模分析中有广泛的应用。