Department of Informatics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany.
Department of Informatics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany; Quantitative Biosciences Munich, Gene Center, Department of Biochemistry, Ludwig-Maximilians Universität München, Feodor-Lynen-Str. 25, 81377 München, Germany.
Am J Hum Genet. 2018 Dec 6;103(6):907-917. doi: 10.1016/j.ajhg.2018.10.025. Epub 2018 Nov 29.
RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.
RNA 测序(RNA-seq)作为基因组测序的补充检测方法,正在越来越受到关注,可用于精确识别罕见疾病的分子病因。一种强有力的方法是识别异常的基因表达水平,将其作为潜在的致病事件。然而,现有的 RNA-seq 数据中异常读取计数检测方法要么缺乏统计显著性评估,因此确定截止值是任意的,要么依赖于对混杂因素进行主观的手动校正。在这里,我们描述了 OUTRIDER(RNA-seq 中的异常值发现者),这是一种针对这些问题开发的算法。该算法使用自动编码器根据技术、环境或常见遗传变异导致的基因共变来构建读取计数预期模型。根据这些预期,RNA-seq 读取计数被假设为遵循具有基因特异性分散的负二项式分布。然后,将显著偏离该分布的读取计数识别为异常值。该模型会自动拟合,以实现对人工污染数据的最佳召回率。使用模拟异常读取计数进行的精度-召回分析表明,控制共变和基于显著性的阈值非常重要。OUTRIDER 是开源的,包括用于过滤未在数据集表达的基因、识别具有过多异常表达基因的异常样本以及基于错误发现率调整的 p 值检测异常基因表达的功能。总的来说,OUTRIDER 为识别异常表达基因提供了一个端到端的解决方案,适合罕见病诊断平台使用。