Suppr超能文献

预测玉米中组织特异性mRNA和蛋白质丰度:一种机器学习方法。

Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach.

作者信息

Cho Kyoung Tak, Sen Taner Z, Andorf Carson M

机构信息

Department of Computer Science, Iowa State University, Ames, IA, United States.

USDA-ARS, Crop Improvement and Genetics Research Unit, Albany, CA, United States.

出版信息

Front Artif Intell. 2022 May 26;5:830170. doi: 10.3389/frai.2022.830170. eCollection 2022.

Abstract

Machine learning and modeling approaches have been used to classify protein sequences for a broad set of tasks including predicting protein function, structure, expression, and localization. Some recent studies have successfully predicted whether a given gene is expressed as mRNA or even translated to proteins potentially, but given that not all genes are expressed in every condition and tissue, the challenge remains to predict condition-specific expression. To address this gap, we developed a machine learning approach to predict tissue-specific gene expression across 23 different tissues in maize, solely based on DNA promoter and protein sequences. For class labels, we defined high and low expression levels for mRNA and protein abundance and optimized classifiers by systematically exploring various methods and combinations of k-mer sequences in a two-phase approach. In the first phase, we developed Markov model classifiers for each tissue and built a feature vector based on the predictions. In the second phase, the feature vector was used as an input to a Bayesian network for final classification. Our results show that these methods can achieve high classification accuracy of up to 95% for predicting gene expression for individual tissues. By relying on sequence alone, our method works in settings where costly experimental data are unavailable and reveals useful insights into the functional, evolutionary, and regulatory characteristics of genes.

摘要

机器学习和建模方法已被用于对蛋白质序列进行分类,以完成一系列广泛的任务,包括预测蛋白质功能、结构、表达和定位。最近的一些研究已经成功预测了给定基因是否会潜在地表达为mRNA甚至翻译为蛋白质,但鉴于并非所有基因在每种条件和组织中都会表达,预测特定条件下的表达仍然是一个挑战。为了弥补这一差距,我们开发了一种机器学习方法,仅基于DNA启动子和蛋白质序列来预测玉米中23种不同组织的组织特异性基因表达。对于类别标签,我们定义了mRNA和蛋白质丰度的高表达水平和低表达水平,并通过两阶段方法系统地探索各种k-mer序列方法和组合来优化分类器。在第一阶段,我们为每个组织开发了马尔可夫模型分类器,并基于预测构建了特征向量。在第二阶段,将特征向量用作贝叶斯网络的输入进行最终分类。我们的结果表明,这些方法在预测单个组织的基因表达时可以达到高达95%的高分类准确率。仅依靠序列,我们的方法在无法获得昂贵实验数据的情况下也能发挥作用,并揭示了基因的功能、进化和调控特征的有用见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c1d/9204276/21976bed092e/frai-05-830170-g0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验