Department of Molecular and Systems Biology.
Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC, 29208, USA.
Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026.
Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).
Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.
FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).
michael.l.whitfield@dartmouth.edu.
Supplementary data are available at Bioinformatics online.
通过转录组谱分析定义的癌症和自身免疫性疾病的分子亚型,为疾病发病机制、分子异质性和治疗反应提供了深入了解。然而,不同基因表达谱分析平台固有的技术偏差在分析来自不同研究的数据时带来了独特的问题。目前,缺乏专门设计的有效方法来消除基于平台的偏差。我们提出了一种使用基于机器学习的分类器对 RNA-seq 数据进行归一化和分类的方法,该分类器是在两个数据集(乳腺癌浸润性癌(BRCA)和结直肠癌(CRC))的 DNA 微阵列数据和分子亚型上进行训练的。
多项分析表明,特征特定分位数归一化(FSQN)可以成功地从 RNA-seq 数据中去除基于平台的偏差,而与特征缩放或机器学习算法无关。我们在将使用 FSQN 和专门在 DNA 微阵列数据上训练的支持向量机归一化的 RNA-seq 数据分配给分子亚型方面实现了高达 98%的 BRCA 数据准确性和 97%的 CRC 数据准确性。我们发现,当归一化包含至少 25 个样本的 RNA-seq 数据集时,可以实现最大准确性。FSQN 允许将 RNA-seq 数据与现有 DNA 微阵列数据集进行比较。使用这些技术,我们可以成功地利用新分析中现有基因表达数据的信息,尽管用于基因表达谱分析的平台不同。
FSQN 已作为 R 包提交给 CRAN。本研究中使用的所有代码都可在 Github 上获得(https://github.com/jenniferfranks/FSQN)。
michael.l.whitfield@dartmouth.edu。
补充数据可在 Bioinformatics 在线获得。