Suppr超能文献

线性回归的稳健化及其在全基因组关联研究中的应用

Robustification of Linear Regression and Its Application in Genome-Wide Association Studies.

作者信息

Alamin Md, Sultana Most Humaira, Xu Haiming, Mollah Md Nurul Haque

机构信息

Institute of Crop Science and Institute of Bioinformatics, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China.

Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh.

出版信息

Front Genet. 2020 Jun 8;11:549. doi: 10.3389/fgene.2020.00549. eCollection 2020.

Abstract

Regression analysis is one of the most popular statistical techniques that attempt to explore the relationships between a response (dependent) variable and one or more explanatory (independent) variables. To test the overall significance of regression, F-statistic is used if the parameters are estimated by the least-squares estimators (LSEs), while if the parameters are estimated by the maximum likelihood estimators (MLEs), the likelihood ratio test (LRT) statistic is used. However, both procedures produce misleading results and often fail to provide good fits to the reasonable space of the dataset in the presence of outlying observations. Moreover, outliers occur very frequently in any real datasets as well as in the molecular OMICS datasets. Hence, an effort is made in this study to robustify MLE based regression analysis by maximizing the β-likelihood function. The tuning parameter β is selected by cross-validation. For β = 0, the proposed method reduces to the classical MLE based regression analysis. We inspect the performance of the proposed method using both synthetic and real data analysis. The results of simulations indicate that the proposed method performs better than traditional methods in both outliers and high leverage points to estimate the parameters and mean square errors. The results of relative efficiency analysis show that the proposed estimator is relatively less affected than the popular estimators, including S, MM, and fast-S for normal error distribution in case high dimension and outliers. Also, real data analysis results demonstrated that the proposed method shows robust properties with respect to data contaminations, overcome the drawback of the traditional methods. Genome-wide association studies (GWAS) by the proposed method identify the vital gene influencing hypertension and iron level in the liver and spleen of mice. Furthermore, we have identified 15 and 21 significant SNPs for chalkiness degree and chalkiness percentage, respectively, by GWAS based on the proposed method. The variant of the SNPs might be provided the new resources for grain quality traits and could be used for further molecular and physiological analysis to enhance the better quality of rice grain. These results offer an important basis for further understanding of the robust regression analysis, which might be applied in various fields, including business, genetics, and bioinformatics.

摘要

回归分析是最流行的统计技术之一,旨在探索响应(因)变量与一个或多个解释(自)变量之间的关系。为了检验回归的整体显著性,如果参数由最小二乘估计量(LSE)估计,则使用F统计量;而如果参数由最大似然估计量(MLE)估计,则使用似然比检验(LRT)统计量。然而,在存在异常观测值的情况下,这两种方法都会产生误导性结果,并且常常无法很好地拟合数据集的合理空间。此外,异常值在任何实际数据集中以及分子组学数据集中都非常频繁地出现。因此,本研究致力于通过最大化β似然函数来增强基于MLE的回归分析。调整参数β通过交叉验证来选择。对于β = 0,所提出的方法简化为基于经典MLE的回归分析。我们使用合成数据分析和实际数据分析来检验所提出方法的性能。模拟结果表明,所提出的方法在异常值和高杠杆点方面,在估计参数和均方误差方面比传统方法表现更好。相对效率分析结果表明,在高维和存在异常值的情况下,对于正态误差分布,所提出的估计量比包括S、MM和快速S在内的流行估计量受影响相对较小。实际数据分析结果也表明,所提出的方法在数据污染方面具有稳健性,克服了传统方法的缺点。通过所提出的方法进行的全基因组关联研究(GWAS)确定了影响小鼠肝脏和脾脏中高血压和铁水平的重要基因。此外,基于所提出的方法,我们通过GWAS分别确定了15个和21个与垩白度和垩白率显著相关的单核苷酸多态性(SNP)。这些SNP的变异可能为谷物品质性状提供新的资源,并可用于进一步的分子和生理分析,以提高水稻籽粒的品质。这些结果为进一步理解稳健回归分析提供了重要依据,稳健回归分析可能应用于包括商业、遗传学和生物信息学在内的各个领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8240/7295010/7b7b959cb3ac/fgene-11-00549-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验