利用 k -mer 语法分析揭示玉米调控结构。

A k-mer grammar analysis to uncover maize regulatory architecture.

机构信息

Institute for Genomic Diversity, Cornell University, 175 Biotechnology Building, Ithaca, 14853, NY, USA.

USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center, Ithaca, 14853, NY, USA.

出版信息

BMC Plant Biol. 2019 Mar 15;19(1):103. doi: 10.1186/s12870-019-1693-2.

DOI:10.1186/s12870-019-1693-2

PMID:30876396

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6419808/

Abstract

BACKGROUND

Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.

RESULTS

We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) "bag-of-words" which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built "bag-of-k-mers" and "vector-k-mers" models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our "bag-of-k-mers" achieved higher overall accuracy, while the "vector-k-mers" models were more useful in highlighting key groups of sequences within the regulatory regions.

CONCLUSIONS

These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.

摘要

背景

只有一小部分基因组序列参与基因表达的调控，但生物化学鉴定这部分序列既昂贵又费力。在玉米等物种中，由于基因间区多样化且存在大量重复元件，这是一个特别具有挑战性的问题，限制了从一条线到另一条线的数据使用。虽然调控区域很少，但它们确实具有特征性的染色质环境和序列组织（语法），可以据此进行识别。

结果

我们开发了一种计算框架来利用这种序列排列。这些模型通过序列特征（k-mer）来学习对调控区域进行分类。为此，我们借鉴了自然语言处理领域的两种方法：（1）“词袋”（Bag-of-Words），常用于情感分析等任务中对关键词进行差异化加权；（2）使用 word2vec（向量-k-mer）的向量空间模型，捕捉单词之间的语义和语言关系。我们构建了“词袋-k-mer”和“向量-k-mer”模型，这些模型可以区分调控区域和非调控区域，平均准确率超过 90%。我们的“词袋-k-mer”模型具有更高的整体准确性，而“向量-k-mer”模型在突出调控区域内关键序列群方面更有用。

结论

这些模型现在为注释其他玉米系的调控区域提供了强大的工具，成本低，准确率高，且无需参考该系。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f659/6419808/5d5a1efb07bd/12870_2019_1693_Fig1_HTML.jpg

相似文献

A k-mer grammar analysis to uncover maize regulatory architecture.利用 k -mer 语法分析揭示玉米调控结构。

BMC Plant Biol. 2019 Mar 15;19(1):103. doi: 10.1186/s12870-019-1693-2.

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。

BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.

Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement.无偏 K- -mer 分析揭示了玉米驯化和改良过程中高度重复序列拷贝数的变化。

Sci Rep. 2017 Feb 10;7:42444. doi: 10.1038/srep42444.

Enhanced regulatory sequence prediction using gapped k-mer features.使用带缺口的 k-mer 特征增强调控序列预测。

PLoS Comput Biol. 2014 Jul 17;10(7):e1003711. doi: 10.1371/journal.pcbi.1003711. eCollection 2014 Jul.

Genome-wide mapping of transcriptional enhancer candidates using DNA and chromatin features in maize.利用玉米中的DNA和染色质特征对转录增强子候选序列进行全基因组定位。

Genome Biol. 2017 Jul 21;18(1):137. doi: 10.1186/s13059-017-1273-4.

G-quadruplex (G4) motifs in the maize (Zea mays L.) genome are enriched at specific locations in thousands of genes coupled to energy status, hypoxia, low sugar, and nutrient deprivation.玉米（Zea mays L.）基因组中的 G-四链体（G4）基序在与能量状态、缺氧、低糖和营养缺乏相关的数千个基因的特定位置富集。

J Genet Genomics. 2014 Dec 20;41(12):627-47. doi: 10.1016/j.jgg.2014.10.004. Epub 2014 Nov 4.

Regulatory modules controlling early shade avoidance response in maize seedlings.调控玉米幼苗早期避荫反应的调控模块。

BMC Genomics. 2016 Mar 31;17:269. doi: 10.1186/s12864-016-2593-6.

The maize genome as a model for efficient sequence analysis of large plant genomes.玉米基因组作为大型植物基因组高效序列分析的模型。

Curr Opin Plant Biol. 2006 Apr;9(2):149-56. doi: 10.1016/j.pbi.2006.01.015. Epub 2006 Feb 3.

Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome.玉米基因组中重复元件的丰度、分布及转录活性

Genome Res. 2001 Oct;11(10):1660-76. doi: 10.1101/gr.188201.

Proliferation of Regulatory DNA Elements Derived from Transposable Elements in the Maize Genome.转座元件衍生的调控 DNA 元件在玉米基因组中的增殖。

Plant Physiol. 2018 Apr;176(4):2789-2803. doi: 10.1104/pp.17.01467. Epub 2018 Feb 20.

引用本文的文献

Recent advances in designing synthetic plant regulatory modules.合成植物调控模块设计的最新进展。

Front Plant Sci. 2025 Apr 2;16:1567659. doi: 10.3389/fpls.2025.1567659. eCollection 2025.

Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery.植物转录因子结合区域发现中计算方法的综合分析

Heliyon. 2024 Oct 10;10(20):e39140. doi: 10.1016/j.heliyon.2024.e39140. eCollection 2024 Oct 30.

Omics-Driven Strategies for Developing Saline-Smart Lentils: A Comprehensive Review.基于组学的耐盐型绿豆研发策略：综述

Int J Mol Sci. 2024 Oct 22;25(21):11360. doi: 10.3390/ijms252111360.

Big data and artificial intelligence-aided crop breeding: Progress and prospects.大数据与人工智能辅助作物育种：进展与展望

J Integr Plant Biol. 2025 Mar;67(3):722-739. doi: 10.1111/jipb.13791. Epub 2024 Oct 28.

Identification of lineage-specific cis-trans regulatory networks related to kiwifruit ripening initiation.与猕猴桃成熟起始相关的谱系特异性顺式-反式调控网络的鉴定

Plant J. 2024 Dec;120(5):1987-1999. doi: 10.1111/tpj.17093. Epub 2024 Oct 27.

When less is more: sketching with minimizers in genomics.少即是多：基因组学中的最小化器草图。

Genome Biol. 2024 Oct 14;25(1):270. doi: 10.1186/s13059-024-03414-4.

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.PETA：评估基于子词标记化的蛋白质迁移学习对下游应用的影响。

J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.

Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician.医学和眼科学中的自然语言处理：21 世纪临床医生的综述。

Asia Pac J Ophthalmol (Phila). 2024 Jul-Aug;13(4):100084. doi: 10.1016/j.apjo.2024.100084. Epub 2024 Jul 25.

PTFSpot: deep co-learning on transcription factors and their binding regions attains impeccable universality in plants.PTFSpot：在转录因子及其结合区域上进行深度协同学习，在植物中实现了完美的通用性。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae324.

A survey of k-mer methods and applications in bioinformatics.生物信息学中k-mer方法及其应用综述。

Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.

本文引用的文献

The maize W22 genome provides a foundation for functional genomics and transposon biology.玉米 W22 基因组为功能基因组学和转座子生物学提供了基础。

Nat Genet. 2018 Sep;50(9):1282-1288. doi: 10.1038/s41588-018-0158-0. Epub 2018 Jul 30.

MUMmer4: A fast and versatile genome alignment system.MUMmer4：一种快速且通用的基因组比对系统。

PLoS Comput Biol. 2018 Jan 26;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. eCollection 2018 Jan.

Improved maize reference genome with single-molecule technologies.利用单分子技术改进玉米参考基因组。

Nature. 2017 Jun 22;546(7659):524-527. doi: 10.1038/nature22971. Epub 2017 Jun 12.

Predicting transcription factor binding motifs from DNA-binding domains, chromatin accessibility and gene expression data.从DNA结合结构域、染色质可及性和基因表达数据预测转录因子结合基序。

Nucleic Acids Res. 2017 Jun 2;45(10):5666-5677. doi: 10.1093/nar/gkx358.

Can We Predict Gene Expression by Understanding Proximal Promoter Architecture?能否通过了解近端启动子结构来预测基因表达？

Trends Biotechnol. 2017 Jun;35(6):530-546. doi: 10.1016/j.tibtech.2017.03.007. Epub 2017 Apr 1.

A sequence-based method to predict the impact of regulatory variants using random forest.一种基于序列的方法，利用随机森林预测调控变异的影响。

BMC Syst Biol. 2017 Mar 14;11(Suppl 2):7. doi: 10.1186/s12918-017-0389-1.

Commentary on the 6th International Symposium of Animal Functional Genomics.第六届动物功能基因组学国际研讨会评论

Genet Sel Evol. 2016 Dec 9;48(1):97. doi: 10.1186/s12711-016-0276-z.

Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape.顺式作用元件组和表观顺式作用元件组特征塑造调控性DNA景观。

Cell. 2016 Sep 8;166(6):1598. doi: 10.1016/j.cell.2016.08.063.

Distant eQTLs and Non-coding Sequences Play Critical Roles in Regulating Gene Expression and Quantitative Trait Variation in Maize.远程 eQTL 和非编码序列在调控玉米基因表达和数量性状变异中发挥关键作用。

Mol Plant. 2017 Mar 6;10(3):414-426. doi: 10.1016/j.molp.2016.06.016. Epub 2016 Jul 2.

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.巴塞特：利用深度卷积神经网络学习可及基因组的调控密码。

Genome Res. 2016 Jul;26(7):990-9. doi: 10.1101/gr.200535.115. Epub 2016 May 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用 k -mer 语法分析揭示玉米调控结构。

A k-mer grammar analysis to uncover maize regulatory architecture.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献