Suppr超能文献

RNA-seq 辅助工具:基于机器学习的方法,以鉴定更多受转录调控的基因。

RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes.

机构信息

Institute for Cellular and Molecular Biology, The University of Texas at Austin, 2506 Speedway, NMS 5.324, Austin, TX, 78712, USA.

Department of Molecular Biosciences, The University of Texas at Austin, 2506 Speedway, NMS 5.324, Austin, TX, 78712, USA.

出版信息

BMC Genomics. 2018 Jul 20;19(1):546. doi: 10.1186/s12864-018-4932-2.

Abstract

BACKGROUND

Although different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA-seq results, there are still limitations and bias on the detectability for certain differentially expressed genes (DEGs). Whether the transcriptional dynamics of a gene can be captured accurately depends on experimental design/operation and the following data analysis processes. The workflow of subsequent data processing, such as reads alignment, transcript quantification, normalization, and statistical methods for ultimate identification of DEGs can influence the accuracy and sensitivity of DEGs analysis, producing a certain number of false-positivity or false-negativity. Machine learning (ML) is a multidisciplinary field that employs computer science, artificial intelligence, computational statistics and information theory to construct algorithms that can learn from existing data sets and to make predictions on new data set. ML-based differential network analysis has been applied to predict stress-responsive genes through learning the patterns of 32 expression characteristics of known stress-related genes. In addition, the epigenetic regulation plays critical roles in gene expression, therefore, DNA and histone methylation data has been shown to be powerful for ML-based model for prediction of gene expression in many systems, including lung cancer cells. Therefore, it is promising that ML-based methods could help to identify the DEGs that are not identified by traditional RNA-seq method.

RESULTS

We identified the top 23 most informative features through assessing the performance of three different feature selection algorithms combined with five different classification methods on training and testing data sets. By comprehensive comparison, we found that the model based on InfoGain feature selection and Logistic Regression classification is powerful for DEGs prediction. Moreover, the power and performance of ML-based prediction was validated by the prediction on ethylene regulated gene expression and the following qRT-PCR.

CONCLUSIONS

Our study shows that the combination of ML-based method with RNA-seq greatly improves the sensitivity of DEGs identification.

摘要

背景

尽管在样本制备和数据分析的不同阶段都应用了不同的质量控制措施,以确保 RNA-seq 结果的可重复性和可靠性,但对于某些差异表达基因 (DEGs) 的检测仍然存在局限性和偏差。一个基因的转录动力学是否能够被准确地捕捉到,取决于实验设计/操作以及后续的数据分析过程。后续数据处理的工作流程,如读取比对、转录物定量、标准化以及最终识别 DEGs 的统计方法,都会影响 DEGs 分析的准确性和敏感性,产生一定数量的假阳性或假阴性。机器学习 (ML) 是一个多学科领域,它运用计算机科学、人工智能、计算统计学和信息论来构建算法,这些算法可以从现有数据集学习,并对新数据集进行预测。基于 ML 的差异网络分析已被应用于通过学习已知应激相关基因的 32 个表达特征的模式来预测应激响应基因。此外,表观遗传调控在基因表达中起着关键作用,因此,在许多系统中,包括肺癌细胞,DNA 和组蛋白甲基化数据已被证明是基于 ML 的模型进行基因表达预测的有力工具。因此,基于 ML 的方法有望帮助识别传统 RNA-seq 方法无法识别的 DEGs。

结果

我们通过在训练和测试数据集上结合三种不同的特征选择算法和五种不同的分类方法,评估了性能,确定了前 23 个最具信息量的特征。通过综合比较,我们发现基于 InfoGain 特征选择和逻辑回归分类的模型对于 DEGs 预测非常强大。此外,通过对乙烯调控基因表达的预测以及随后的 qRT-PCR 验证,验证了基于 ML 的预测的有效性和性能。

结论

我们的研究表明,将 ML 方法与 RNA-seq 相结合,大大提高了 DEGs 鉴定的灵敏度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/014f/6053725/5ad7bd392b57/12864_2018_4932_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验