Suppr超能文献

使用稳健统计方法从RNA测序数据中鉴定生物标志物

Biomarker Identification from RNA-Seq Data using a Robust Statistical Approach.

作者信息

Akond Zobaer, Alam Munirul, Mollah Md Nurul Haque

机构信息

Agricultural Statistics and Information & Communication Technology (ASICT) Division, Bangladesh Agricultural Research Institute (BARI), Joydebpur, Gazipur-1701, Bangladesh.

Institute of Environmental Science, University of Rajshahi-6205, Bangladesh.

出版信息

Bioinformation. 2018 Apr 30;14(4):153-163. doi: 10.6026/97320630014153. eCollection 2018.

Abstract

Biomarker identification by differentially expressed genes (DEGs) using RNA-sequencing technology is an important task to characterize the transcriptomics data. This is possible with the advancement of next-generation sequencing technology (NGS). There are a number of statistical techniques to identify DEGs from high-dimensional RNA-seq count data with different groups or conditions such as edgeR, SAMSeq, voom-limma, etc. However, these methods produce high false positives and low accuracy in presence of outliers. We describe a robust t-statistic method to overcome these drawbacks using both simulated and real RNA-seq datasets. The model performance with 61.2%, 35.2%, 21.6%, 6.9%, 74.5%, 78.4%, 93.1%, 35.2% sensitivity, specificity, MER, FDR, AUC, ACC, PPV, and NPV, respectively at 20% outliers is reported. We identified 409 DE genes with p-values<0.05 using robust t-test in HIV viremic vs avirmeic state real dataset. There are 28 up-regulated genes and 381 down-regulated genes estimated by log2 fold change (FC) approach at threshold value 1.5. The up-regulated genes form three clusters and it is found that 11 genes are highly associated in HIV- 1/AIDS. Protein-protein interaction (PPI) of up-regulated genes using STRING database found 21 genes with strong association among themselves. Thus, the identification of potential biomarkers from RNA-seq dataset using a robust t-statistical model is demonstrated.

摘要

利用RNA测序技术通过差异表达基因(DEGs)进行生物标志物鉴定是表征转录组学数据的一项重要任务。随着下一代测序技术(NGS)的发展,这成为可能。有许多统计技术可用于从具有不同组或条件的高维RNA-seq计数数据中鉴定DEGs,如edgeR、SAMSeq、voom-limma等。然而,在存在异常值的情况下,这些方法会产生高假阳性和低准确性。我们描述了一种稳健的t统计方法,使用模拟和真实的RNA-seq数据集来克服这些缺点。报告了该模型在20%异常值情况下的性能,敏感性、特异性、MER、FDR、AUC、ACC、PPV和NPV分别为61.2%、35.2%、21.6%、6.9%、74.5%、78.4%、93.1%、35.2%。我们在HIV病毒血症与无病毒血症状态的真实数据集中使用稳健t检验鉴定出409个p值<0.05的DE基因。通过log2倍变化(FC)方法在阈值1.5处估计有28个上调基因和381个下调基因。上调基因形成三个簇,并且发现有11个基因与HIV-1/AIDS高度相关。使用STRING数据库对上调基因进行蛋白质-蛋白质相互作用(PPI)分析发现,其中21个基因之间存在强关联。因此,证明了使用稳健t统计模型从RNA-seq数据集中鉴定潜在生物标志物的方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验