• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DNA 单词聚类与生物功能:原理验证。

Clustering of DNA words and biological function: a proof of principle.

机构信息

Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain.

出版信息

J Theor Biol. 2012 Mar 21;297:127-36. doi: 10.1016/j.jtbi.2011.12.024. Epub 2011 Dec 30.

DOI:10.1016/j.jtbi.2011.12.024
PMID:22226985
Abstract

Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.

摘要

文本中相关的词(关键词)是聚集的,而常见词则是随机分布的。鉴于许多功能基因组元件的聚集分布,我们假设作为生物学文本典范的 DNA 序列可能也具有同样的特征:具有明确功能的 k 长度词(k -mer)可能沿着一维染色体序列聚集,而不太重要的、非功能的词则可能随机分布。为了探索这种语言上的类比,我们计算了人类和小鼠染色体序列中每个 k-mer(k=2-9bp)的聚类系数,然后检查聚集的词是否在基因组的功能部分富集。首先,我们发现聚类水平和词在exon 和转录因子结合位点(TFBS)中的富集程度之间存在正相关的一般趋势,而在重复序列中这种关系要弱得多,在 intron 中则完全不存在这种关系。其次,我们发现 200 个最聚集的 8-mer 中有 38.45%(即 79 个)存在于已知的 motif 数据库中,而非聚集的词只有 7.70%(即 16 个)存在。第三,富集/缺失实验表明,高度聚集的词在 exon 和 TFBS 中显著富集,而在 intron 和重复 DNA 中则明显缺失。将 exon 和 TFBS 一起考虑,人类的前 1417 个(或 72.26%)和小鼠的前 1385 个(或 72.97%)最聚集的 8-mer 与 exon 或 TFBS 有统计学意义上的显著关联,因此强烈支持词聚类与生物功能之间的联系。最后,我们确定了一组聚集的、具有诊断意义的词,它们在 exon 中富集而在 intron 中缺失,因此可能有助于区分这两个基因区域。DNA 词的聚类因此成为在基因组序列中检测功能的新原则。由于进化保守性不是先决条件,这里描述的原理证明可能开辟新的途径来检测物种特异性的功能 DNA 序列,并改进基因和启动子预测,从而有助于在基因组中寻找功能。

相似文献

1
Clustering of DNA words and biological function: a proof of principle.DNA 单词聚类与生物功能:原理验证。
J Theor Biol. 2012 Mar 21;297:127-36. doi: 10.1016/j.jtbi.2011.12.024. Epub 2011 Dec 30.
2
DNA clustering and genome complexity.DNA聚类与基因组复杂性。
Comput Biol Chem. 2014 Dec;53 Pt A:71-8. doi: 10.1016/j.compbiolchem.2014.08.011. Epub 2014 Aug 23.
3
Integrating genomic data to predict transcription factor binding.整合基因组数据以预测转录因子结合
Genome Inform. 2005;16(1):83-94.
4
Composition-sensitive analysis of the human genome for regulatory signals.对人类基因组进行调节信号的成分敏感分析。
In Silico Biol. 2003;3(1-2):145-71. Epub 2003 Jun 27.
5
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
6
Comparison of human chromosome 21 conserved nongenic sequences (CNGs) with the mouse and dog genomes shows that their selective constraint is independent of their genic environment.人类21号染色体保守非基因序列(CNGs)与小鼠和狗的基因组的比较表明,它们的选择性限制与其基因环境无关。
Genome Res. 2004 May;14(5):852-9. doi: 10.1101/gr.1934904. Epub 2004 Apr 12.
7
Distribution of exonic splicing enhancer elements in human genes.外显子剪接增强子元件在人类基因中的分布。
Genomics. 2005 Sep;86(3):329-36. doi: 10.1016/j.ygeno.2005.05.011.
8
Structural analysis and promoter characterization of the human collagenase-3 gene (MMP13).人胶原酶-3基因(MMP13)的结构分析及启动子特征研究
Genomics. 1997 Mar 1;40(2):222-33. doi: 10.1006/geno.1996.4554.
9
Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels.区分功能 DNA 词;一种测量聚类水平的方法。
Sci Rep. 2017 Jan 27;7:41543. doi: 10.1038/srep41543.
10
Identification of functional transcription factor binding sites using closely related Saccharomyces species.利用近缘酿酒酵母物种鉴定功能性转录因子结合位点
Genome Res. 2005 May;15(5):701-9. doi: 10.1101/gr.3578205. Epub 2005 Apr 18.

引用本文的文献

1
Analyzing similarities in genome sequences.分析基因组序列中的相似性。
Eur Phys J E Soft Matter. 2018 Jan 19;41(1):8. doi: 10.1140/epje/i2018-11609-8.
2
PISMA: A Visual Representation of Motif Distribution in DNA Sequences.PISMA:DNA序列中基序分布的可视化表示。
Bioinform Biol Insights. 2017 Mar 30;11:1177932217700907. doi: 10.1177/1177932217700907. eCollection 2017.
3
Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast.酵母中含CG二核苷酸的八聚体的进化机制及生物学功能
Chromosome Res. 2017 Jun;25(2):173-189. doi: 10.1007/s10577-017-9554-z. Epub 2017 Feb 9.
4
Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels.区分功能 DNA 词;一种测量聚类水平的方法。
Sci Rep. 2017 Jan 27;7:41543. doi: 10.1038/srep41543.
5
Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems.语言系统中文本动态构建过程及词汇组织缩放规律模型
PLoS One. 2016 Dec 22;11(12):e0168971. doi: 10.1371/journal.pone.0168971. eCollection 2016.
6
Extracting DNA words based on the sequence features: non-uniform distribution and integrity.基于序列特征提取DNA单词:非均匀分布和完整性。
Theor Biol Med Model. 2016 Jan 25;13:2. doi: 10.1186/s12976-016-0028-3.
7
An improved alignment-free model for DNA sequence similarity metric.一种用于DNA序列相似性度量的改进的无比对模型。
BMC Bioinformatics. 2014 Sep 28;15(1):321. doi: 10.1186/1471-2105-15-321.