CIDMA - Center for Research and Development in Mathematics and Applications, Department of Mathematics, University of Aveiro, 3810-193 Aveiro, Portugal.
J Theor Biol. 2013 Oct 21;335:153-9. doi: 10.1016/j.jtbi.2013.06.032. Epub 2013 Jul 2.
Previous studies have suggested that Chargaff's second rule may hold for relatively long words (above 10nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule. We use several techniques to evaluate the phenomenon of symmetry, including Pearson's correlation coefficient, total variational distance, a novel word symmetry distance, as well as traditional and equivalence statistical tests. We conclude that word symmetries are statistical significant in the human genome for word lengths up to 6nucleotides. For longer words, we present evidence that the phenomenon may not be as prevalent as previously thought.
先前的研究表明,Chargaff 的第二规则可能适用于相对较长的单词(超过 10 个核苷酸),但这尚未得到明确证明。特别是,以下问题仍未解决:对称现象在统计学上是否显著?如果是这样,失去显著性的单词长度是多少?能否识别由于数据有限大小而导致的对称性偏差?这项工作通过研究人类基因组、染色体和转录组中的单词对称性来解决这些问题。为了排除有限长度的影响,将结果与通过构建满足 Chargaff 第二奇偶校验规则的随机对照序列获得的结果进行比较。我们使用几种技术来评估对称现象,包括 Pearson 相关系数、总方差距离、新的单词对称距离以及传统和等价统计检验。我们的结论是,在人类基因组中,单词长度高达 6 个核苷酸的单词对称性在统计学上是显著的。对于更长的单词,我们提供的证据表明,这种现象可能不像以前想象的那么普遍。