Gold Maxwell P, LeNail Alexander, Fraenkel Ernest
Department of Biological Engineering, Massachusetts Institute of Technology, 21 Ames St. Cambridge, MA, 02139, USA.
Pac Symp Biocomput. 2019;24:374-385.
When analyzing biological data, it can be helpful to consider gene sets, or predefined groups of biologically related genes. Methods exist for identifying gene sets that are differential between conditions, but large public datasets from consortium projects and single-cell RNA-Sequencing have opened the door for gene set analysis using more sophisticated machine learning techniques, such as autoencoders and variational autoencoders. We present shallow sparsely-connected autoencoders (SSCAs) and variational autoencoders (SSCVAs) as tools for projecting gene-level data onto gene sets. We tested these approaches on single-cell RNA-Sequencing data from blood cells and on RNA-Sequencing data from breast cancer patients. Both SSCA and SSCVA can recover known biological features from these datasets and the SSCVA method often outperforms SSCA (and six existing gene set scoring algorithms) on classification and prediction tasks.
在分析生物学数据时,考虑基因集(即生物学相关基因的预定义组)可能会有所帮助。存在用于识别不同条件之间差异的基因集的方法,但来自联盟项目的大型公共数据集和单细胞RNA测序为使用更复杂的机器学习技术(如自动编码器和变分自动编码器)进行基因集分析打开了大门。我们提出了浅稀疏连接自动编码器(SSCA)和变分自动编码器(SSCVA)作为将基因水平数据投影到基因集上的工具。我们在血细胞的单细胞RNA测序数据和乳腺癌患者的RNA测序数据上测试了这些方法。SSCA和SSCVA都可以从这些数据集中恢复已知的生物学特征,并且在分类和预测任务上,SSCVA方法通常优于SSCA(以及六种现有的基因集评分算法)。