“基因组顺序索引”不应用于定义核苷酸序列中的组成约束——以 Z 曲线为例。

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve.

机构信息

McKusick - Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.

出版信息

Biol Direct. 2010 Feb 17;5:10. doi: 10.1186/1745-6150-5-10.

DOI:10.1186/1745-6150-5-10

PMID:20158921

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2841071/

Abstract

BACKGROUND

The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/ square root 3 contains almost all points corresponding to various genomes, implying that S <r2. The distribution of the points P obtained by S was studied using the Z-curve.

RESULTS

In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.

CONCLUSION

The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.

摘要

背景

Z 曲线是一种三维表示 DNA 序列的方法，十多年前提出，已广泛应用于序列分割、水平基因转移检测和序列分析。基于 Z 曲线，提出了“基因组顺序指数”，定义为 S = a2+ c2+t2+g2，其中 a、c、t 和 g 分别是 A、C、T 和 G 的核苷酸频率。几乎所有测试的基因组的这个指数都小于 1/3，这被认为是基因组组成存在约束的支持。已经提出了这种约束的几何解释。每个基因组都由一个点 P 表示，该点 P 距离正四面体的四个面的距离由频率 a、c、t 和 g 给出。他们声称，半径为 r = 1/ square root 3 的内接球几乎包含了对应于各种基因组的所有点，这意味着 S <r2。使用 Z 曲线研究了通过 S 获得的点 P 的分布。

结果

在这项工作中，我们使用“基因组顺序指数”作为案例研究，研究了 Z 曲线的基本性质。我们表明：（1）正四面体内接球半径的计算是不正确的；（2）S 指数分布较窄；（3）基于第二奇偶规则，S 指数可以直接从香农熵推导出来，因此是多余的；（4）Z 曲线存在过度维数，并且维度仅足以代表任何给定的基因组的 GC 含量。

结论

“基因组顺序指数”S 不代表核苷酸组成的约束。此外，S 可以从基尼-辛普森指数轻松计算，并直接从熵推导出来，因此是多余的。总体而言，Z 曲线和 S 分别是 GC 含量和香农 H 指数的过度复杂的度量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5aa/2841071/a9d8c3ceb7e7/1745-6150-5-10-1.jpg

相似文献

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve.

Biol Direct. 2010 Feb 17;5:10. doi: 10.1186/1745-6150-5-10.

A rebuttal to the comments on the genome order index and the Z-curve.

Biol Direct. 2011 Feb 16;6:10. doi: 10.1186/1745-6150-6-10.

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences.

Comput Biol Chem. 2008 Apr;32(2):147. doi: 10.1016/j.compbiolchem.2007.11.003. Epub 2007 Dec 15.

GC content and genome length in Chargaff compliant genomes.

Biochem Biophys Res Commun. 2007 Feb 2;353(1):207-10. doi: 10.1016/j.bbrc.2006.12.008. Epub 2006 Dec 11.

A rebuttal to the comments on the genome order index.

Comput Biol Chem. 2009 Aug;33(4):350. doi: 10.1016/j.compbiolchem.2008.11.001. Epub 2008 Nov 14.

A nucleotide composition constraint of genome sequences.

Comput Biol Chem. 2004 Apr;28(2):149-53. doi: 10.1016/j.compbiolchem.2004.02.002.

The Z curve database: a graphic representation of genome sequences.

Bioinformatics. 2003 Mar 22;19(5):593-9. doi: 10.1093/bioinformatics/btg041.

Quantitative analysis and assessment of base composition asymmetry and gene orientation bias in bacterial genomes.

FEBS Lett. 2019 May;593(9):918-925. doi: 10.1002/1873-3468.13374. Epub 2019 Apr 11.

Constraint on di-nucleotides by codon usage bias in bacterial genomes.

Gene. 2014 Feb 15;536(1):18-28. doi: 10.1016/j.gene.2013.11.098. Epub 2013 Dec 11.

Quantitative analysis of correlation between AT and GC biases among bacterial genomes.

PLoS One. 2017 Feb 3;12(2):e0171408. doi: 10.1371/journal.pone.0171408. eCollection 2017.

引用本文的文献

Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated.

Sci Rep. 2022 Aug 29;12(1):14683. doi: 10.1038/s41598-022-14395-4.

A rebuttal to the comments on the genome order index and the Z-curve.

Biol Direct. 2011 Feb 16;6:10. doi: 10.1186/1745-6150-6-10.

本文引用的文献

Relations between Shannon entropy and genome order index in segmenting DNA sequences.

Phys Rev E Stat Nonlin Soft Matter Phys. 2009 Apr;79(4 Pt 1):041918. doi: 10.1103/PhysRevE.79.041918. Epub 2009 Apr 21.

A rebuttal to the comments on the genome order index.

Comput Biol Chem. 2009 Aug;33(4):350. doi: 10.1016/j.compbiolchem.2008.11.001. Epub 2008 Nov 14.

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences.

Comput Biol Chem. 2008 Apr;32(2):147. doi: 10.1016/j.compbiolchem.2007.11.003. Epub 2007 Dec 15.

A test of Chargaff's second rule.

Biochem Biophys Res Commun. 2006 Feb 3;340(1):90-4. doi: 10.1016/j.bbrc.2005.11.160. Epub 2005 Dec 7.

A nucleotide composition constraint of genome sequences.

Comput Biol Chem. 2004 Apr;28(2):149-53. doi: 10.1016/j.compbiolchem.2004.02.002.

Isochore structures in the mouse genome.

Genomics. 2004 Mar;83(3):384-94. doi: 10.1016/j.ygeno.2003.09.011.

An isochore map of the human genome based on the Z curve method.

Gene. 2003 Oct 23;317(1-2):127-35. doi: 10.1016/s0378-1119(03)00665-6.

Identification of genomic islands in the genome of Bacillus cereus by comparative analysis with Bacillus anthracis.

Physiol Genomics. 2003 Dec 16;16(1):19-23. doi: 10.1152/physiolgenomics.00170.2003.

Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis.

Biochem Biophys Res Commun. 2003 Nov 7;311(1):215-22. doi: 10.1016/j.bbrc.2003.09.198.

A novel method to calculate the G+C content of genomic DNA sequences.

J Biomol Struct Dyn. 2001 Oct;19(2):333-41. doi: 10.1080/07391102.2001.10506743.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

“基因组顺序索引”不应用于定义核苷酸序列中的组成约束——以 Z 曲线为例。

'Genome order index' should not be used for defining compositional constraints in nucleotide sequences--a case study of the Z-curve.

机构信息

McKusick - Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.