Suppr超能文献

利用 k -mer 语法分析揭示玉米调控结构。

A k-mer grammar analysis to uncover maize regulatory architecture.

机构信息

Institute for Genomic Diversity, Cornell University, 175 Biotechnology Building, Ithaca, 14853, NY, USA.

USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center, Ithaca, 14853, NY, USA.

出版信息

BMC Plant Biol. 2019 Mar 15;19(1):103. doi: 10.1186/s12870-019-1693-2.

Abstract

BACKGROUND

Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.

RESULTS

We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) "bag-of-words" which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built "bag-of-k-mers" and "vector-k-mers" models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our "bag-of-k-mers" achieved higher overall accuracy, while the "vector-k-mers" models were more useful in highlighting key groups of sequences within the regulatory regions.

CONCLUSIONS

These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.

摘要

背景

只有一小部分基因组序列参与基因表达的调控,但生物化学鉴定这部分序列既昂贵又费力。在玉米等物种中,由于基因间区多样化且存在大量重复元件,这是一个特别具有挑战性的问题,限制了从一条线到另一条线的数据使用。虽然调控区域很少,但它们确实具有特征性的染色质环境和序列组织(语法),可以据此进行识别。

结果

我们开发了一种计算框架来利用这种序列排列。这些模型通过序列特征(k-mer)来学习对调控区域进行分类。为此,我们借鉴了自然语言处理领域的两种方法:(1)“词袋”(Bag-of-Words),常用于情感分析等任务中对关键词进行差异化加权;(2)使用 word2vec(向量-k-mer)的向量空间模型,捕捉单词之间的语义和语言关系。我们构建了“词袋-k-mer”和“向量-k-mer”模型,这些模型可以区分调控区域和非调控区域,平均准确率超过 90%。我们的“词袋-k-mer”模型具有更高的整体准确性,而“向量-k-mer”模型在突出调控区域内关键序列群方面更有用。

结论

这些模型现在为注释其他玉米系的调控区域提供了强大的工具,成本低,准确率高,且无需参考该系。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f659/6419808/5d5a1efb07bd/12870_2019_1693_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验