分位数回归在 eQTL 映射困难案例中的应用。

Quantile regression for challenging cases of eQTL mapping.

机构信息

Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, USA.

出版信息

Brief Bioinform. 2020 Sep 25;21(5):1756-1765. doi: 10.1093/bib/bbz097.

DOI:10.1093/bib/bbz097

PMID:31688892

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7673343/

Abstract

Mapping of expression quantitative trait loci (eQTLs) facilitates interpretation of the regulatory path from genetic variants to their associated disease or traits. High-throughput sequencing of RNA (RNA-seq) has expedited the exploration of these regulatory variants. However, eQTL mapping is usually confronted with the analysis challenges caused by overdispersion and excessive dropouts in RNA-seq. The heavy-tailed distribution of gene expression violates the assumption of Gaussian distributed errors in linear regression for eQTL detection, which results in increased Type I or Type II errors. Applying rank-based inverse normal transformation (INT) can make the expression values more normally distributed. However, INT causes information loss and leads to uninterpretable effect size estimation. After comprehensive examination of the impact from overdispersion and excessive dropouts, we propose to apply a robust model, quantile regression, to map eQTLs for genes with high degree of overdispersion or large number of dropouts. Simulation studies show that quantile regression has the desired robustness to outliers and dropouts, and it significantly improves eQTL mapping. From a real data analysis, the most significant eQTL discoveries differ between quantile regression and the conventional linear model. Such discrepancy becomes more prominent when the dropout effect or the overdispersion effect is large. All the results suggest that quantile regression provides more reliable and accurate eQTL mapping than conventional linear models. It deserves more attention for the large-scale eQTL mapping.

摘要

表达数量性状基因座 (eQTL) 的映射有助于解释遗传变异与其相关疾病或特征之间的调控途径。RNA 的高通量测序 (RNA-seq) 加速了这些调控变异体的探索。然而，eQTL 映射通常面临 RNA-seq 中过度分散和大量缺失值引起的分析挑战。基因表达的长尾分布违反了 eQTL 检测中线性回归中高斯分布误差的假设，导致 I 型或 II 型错误增加。应用基于秩的逆正态变换 (INT) 可以使表达值更接近正态分布。然而，INT 会导致信息丢失，并导致无法解释的效应大小估计。在全面检查了过度分散和大量缺失值的影响后，我们提出应用稳健模型——分位数回归来映射具有高度过度分散或大量缺失值的基因的 eQTL。模拟研究表明，分位数回归对离群值和缺失值具有所需的稳健性，并显著改善了 eQTL 映射。通过真实数据分析，分位数回归和传统线性模型之间的最显著 eQTL 发现存在差异。当缺失效应或过度分散效应较大时，这种差异更加明显。所有结果表明，分位数回归比传统线性模型提供了更可靠和准确的 eQTL 映射。它值得在大规模 eQTL 映射中引起更多关注。