Castillo Daniel, Gálvez Juan Manuel, Herrera Luis Javier, Román Belén San, Rojas Fernando, Rojas Ignacio
Department of Computer Architecture and Technology, University of Granada, Periodista Rafael Gómez Montero, 2, Granada, 18014, Spain.
BMC Bioinformatics. 2017 Nov 21;18(1):506. doi: 10.1186/s12859-017-1925-0.
Nowadays, many public repositories containing large microarray gene expression datasets are available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data. Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the Differentially Expressed Genes when unseen data is presented for diagnosis.
The proposed data integration allows analyzing gene expression samples coming from different technologies. The most significant genes of the whole integrated data were obtained through the intersection of the three gene sets, corresponding to the identified expressed genes within the microarray data itself, within the RNA-Seq data itself, and within the integrated data from both technologies. This intersection reveals 98 possible technology-independent biomarkers. Two different heterogeneous datasets were distinguished for the classification tasks: a training dataset for gene expression identification and classifier validation, and a test dataset with unseen data for testing the classifier. Both of them achieved great classification accuracies, therefore confirming the validity of the obtained set of genes as possible biomarkers for breast cancer. Through a feature selection process, a final small subset made up by six genes was considered for breast cancer diagnosis.
This work proposes a novel data integration stage in the traditional gene expression analysis pipeline through the combination of heterogeneous data from microarrays and RNA-Seq technologies. Available samples have been successfully classified using a subset of six genes obtained by a feature selection method. Consequently, a new classification and diagnosis tool was built and its performance was validated using previously unseen samples.
如今,有许多包含大型微阵列基因表达数据集的公共存储库。然而,问题在于微阵列技术不如更新的下一代测序技术(如RNA测序)强大和准确。无论如何,微阵列提供的信息是真实可靠的,因此可以通过将微阵列数据与RNA测序数据整合来加以利用。此外,在RNA测序中,提取和获取大量样本的信息在时间和计算资源方面仍然需要很高的成本。本文提出了一种新模型,通过整合来自不同乳腺癌数据集的异构数据(这些数据来自微阵列和RNA测序技术)来寻找乳腺癌细胞系的基因特征。因此,数据整合有望为所得结果提供更强的统计显著性。最后,提出了一种分类方法,以测试在呈现未见数据用于诊断时差异表达基因的稳健性。
所提出的数据整合允许分析来自不同技术的基因表达样本。通过三个基因集的交集获得了整个整合数据中最显著的基因,这三个基因集分别对应于微阵列数据本身、RNA测序数据本身以及两种技术的整合数据中确定的表达基因。这个交集揭示了98个可能与技术无关的生物标志物。为分类任务区分了两个不同的异构数据集:一个用于基因表达识别和分类器验证的训练数据集,以及一个用于测试分类器的包含未见数据的测试数据集。它们都取得了很高的分类准确率,从而证实了所获得的基因集作为乳腺癌可能生物标志物的有效性。通过特征选择过程,最终考虑了由六个基因组成的小子集用于乳腺癌诊断。
这项工作通过结合微阵列和RNA测序技术的异构数据,在传统基因表达分析流程中提出了一个新的数据整合阶段。使用通过特征选择方法获得的六个基因的子集成功地对可用样本进行了分类。因此,构建了一种新的分类和诊断工具,并使用先前未见的样本验证了其性能。